Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Passwords and PINs are prevalent user authentication techniques primarily because they are easy to implement, require no special hardware, and users tend to understand them well [11]. However, one of their inherent disadvantages is susceptibility to shoulder surfing attacks [23] of which there are two main types: (1) input-based and (2) output-based. The former is more common; in it, the adversary observes an input device (keyboard or keypad) as the user enters a secret (password or PIN) and learns the key-presses. The latter involves the adversary observing an output device (screen or projector) while the user enters a secret which is displayed in cleartext. The principal distinction between the two types is adversary’s proximity: observing input devices requires the adversary to be closer to the victim than observing output devices, which tend to have larger form factors, i.e., physical dimensions.

Completely disabling on-screen feedback during secret entry (as in, e.g., Unix sudo command) mitigates output-based shoulder-surfing attacks. Unfortunately, it also impacts usability: when deprived of visual feedback, users cannot determine whether a given key-press was registered and are thus more apt to make mistakes. In order to balance security and usability, user interfaces typically implement password masking by displaying a generic symbol (e.g., “\(\bullet \)” or “\(*\)”) after each keystroke. This technique is commonly used on desktops, laptops and smartphones as well as on public devices, such as Automated Teller Machines (ATMs) or Point-of-Sale (PoS) terminals at shops or gas stations.

Despite the popularity of password masking, little has been done to quantify how visual keystroke feedback impacts security. In particular, masking assumes that showing generic symbols does not reveal any information about the corresponding secret. This assumption seems reasonable, since visual representation of a generic symbol is independent of the key-press. However, in this paper we show that this assumption is incorrect. By leveraging precise inter-keystroke timing information leaked by the appearance of each masking symbol, we show that the adversary can significantly narrow down the user secret’s search space. Put another way, the number of attempts required to brute-force a secret decreases appreciably when the adversary has access to inter-keystroke timing information.

There are many realistic settings where visual inter-keystroke timing information (leaked via appearance of masking symbols) is readily available while the input information is not, i.e., the input device is not easily observable. For example, in a typical lecture or classroom scenario, the presenter’s keyboard is usually out of sight, while the external projector display is wide-open for recording. Similarly, in a multi-person office scenario, an adversarial co-worker can surreptitiously record the victim’s screen. The same holds in public scenarios, such as PoS terminals and ATMs, where displays (though smallish) tend to be easier to observe and record than entry keypads.

In this paper we consider two representative scenarios: (1) a presenter enters a password into a computer connected to an external projector; (2) a user enters a PIN at an ATM in a public location. The adversary is assumed to record keystroke feedback from the projector display or an ATM screen using a dedicated video camera or a smartphone. We note that a human adversary does not need to be present during the attack: recording might be done via an existing camera either pre-installed or pre-compromised by the adversary, possibly remotely, e.g., as in the infamous Mirai botnet [14].

Contributions. The main goal of this paper is to quantify the amount of information leaked through video recordings of on-screen keystroke feedback. To this end, we conducted extensive data collection experiments that involved 84 subjectsFootnote 1. Each subject was asked to type passwords or PINs while the screen or projector was video-recorded using either a commodity video camera and a smartphone camera. Based on this, we determined the key statistical properties of resulting data, and set up an attack, called SILK-TV: Secret Information Leakage from Keystroke Timing Videos. It allows us to quantify reduction in brute-force search space due to timing information. SILK-TV leverages multiple publicly available typing datasets to extract population timings, and applies this information to inter-keystroke timings extracted from videos.

Our results show that video recordings can be effective in extracting precise inter-keystroke timing information. Experiments show that SILK-TV substantially reduces the search space for each password, even when the adversary has no access to user-specific keystroke templates. When run on passwords, SILK-TV performed better than random guessing between 87% and 100% of the time, depending on the password and the machine learning technique used to instantiate the attack. The resulting average speedup is between 25% and 385% (depending on the password), compared to random dictionary-based guessing; some passwords were correctly guessed in as few as 68 attempts. A single password timing disclosure is enough for SILK-TV to successfully achieve these results. However, when the adversary observes the user entering the password three times, SILK-TV can crack the password in as few as 19 attempts. Clearly, SILK-TV’s benefits depend in part on the strength of a specific password. With very common passwords, benefits of SILK-TV are limited. Meanwhile, we show that SILK-TV substantially outperforms random guessing with less common passwords. With PINs, disclosure of timing poses only a minimal risk – SILK-TV reduced the number of guessing attempts by a mere 3.8%, on average.

Paper Organization. Section 2 overviews state-of-the-art in password guessing based on timing attacks. Section 3 presents SILK-TV and the adversary model. Section 4 discusses our data collection and experiments. We then present the results on password guessing using SILK-TV in Sect. 5, and on PIN guessing in Sect. 6. The paper concludes with the summary and future work directions in Sect. 7.

2 Related Work

There is a large body of prior work on timing attacks in the context of keyboard-based password entry. Song et al. [21] demonstrated a weakness that allows the adversary to extract information about passwords typed during SSH sessions. The attack relies on the fact that, to minimize latency, SSH transmits each keystroke immediately after entry, in a separate IP packet. By eavesdropping on such packets, the adversary can collect accurate inter-keystroke timing information. Authors in [21] showed that this information can be used to restrict the search space of passwords. The impact of this work is significant, because it shows the power of timing attacks on cracking passwords.

There are several studies of keystroke inference from analysis of video recordings. Balzarotti et al. [4] addressed the typical shoulder-surfing scenario, where a camera tracks hand and finger movements on the keyboard. Text was automatically reconstructed from resulting videos. Similarly, Xu et al. [30] recorded user’s finger movements on mobile devices to infer keystroke information. Unfortunately, neither attack applies to our sample scenarios, where the keyboard is invisible to the adversary.

Shukla et al. [20] showed that text can be inferred even from videos where the keyboard/keypad is not visible. This attack involved analyzing video recordings of the back of the user’s hand holding a smartphone in order to infer which location on the screen is tapped. By observing the motion of the user’s hand, the path of the finger across the screen can be reconstructed, which yields the typed text. In a similar attack, Sun et al. [22] successfully reconstructed text typed on tablets by recording and analyzing the tablet’s movements, rather than movements of the user’s hands.

Another line of work aimed to quantify keystroke information inadvertently leaked by motion sensors. Owusu et al. [16] studied this in the context of a smartphone’s inertial sensors while the user types using the on-screen keyboard. The application used to implement this attack does not require special privileges, since modern smartphone operating systems do not require explicit authorization to access inertial sensors data. Similarly, Wang et al. [27] explored keystroke information leakage from inertial sensors on wearable devices, e.g., smartwatches and fitness trackers. By estimating the motion of a wearable device placed on the wrist of the user, movements of the user’s hand over a keyboard can be inferred. This allows learning which keys were pressed during the hand’s path. Compared to our work, both [16, 27] require a substantially higher level of access to the user’s device. To collect data from inertial sensors the adversary must have previously succeeded in deceiving the user into installing a malicious application, or otherwise compromised the user’s device. In contrast, SILK-TV is a fully passive attack.

Acoustic emanations represent another effective side-channel for keystroke inference. This class of attacks is based on the observation that different keyboard keys emit subtly different sounds when pressed. This information can be captured (1) locally, using microphones placed near the keyboard [3, 32], or (2) remotely, via Voice-over-IP [8]. Also, acoustic emanations captured using multiple microphones can be used to extract locations of keys on a keyboard. As shown by Zhou et al. [31], recordings from multiple microphones can be used to accurately quantify time difference of arrival (TDoA), and thus triangulate positions of pressed keys.

3 System and Adversary Model

We now present the system and adversary model used in the rest of the paper.

We model a user logging in (authenticating) to a computer system or an ATM using a PIN or a password (secret) entered via keyboard or keypad (input device). The user receives immediate feedback about each key-press from a screen, a projector, or both (output device) in the form of dots or asterisks (masking symbols). Shape and/or location of each masking symbol does not depend on which key is pressed. The adversary can observe and record the output device(s), though not the input device or the user’s hands. An example of this scenario is shown in Fig. 1. The adversary’s goal is to learn the user’s secret.

The envisaged attack setting is representative of many real-world scenarios that involve low-privilege adversaries, including: (1) a presenter in a lecture or conference who types a password while the screen is displayed on a projector. The entire audience can see the timing of appearance of masking symbols, and the adversary can be anyone in the audience; (2) an ATM customer typing a PIN. The adversary who stands in line behind the user might have an unobstructed view of the screen, and the timing of appearance of masking symbols (see Fig. 2); and (3) a customer enters her debit card PIN at a self-service gas-station pump. In this case, the adversary can be anyone in the surroundings with a clear view of the pump’s screen.

Although these scenarios seem to imply that adversary is located near the user, proximity is not a requirement for our attack. For instance, the adversary could watch a prior recording of the lecture in scenario (1); or, could be monitoring the ATM machine using a CCTV camera in (2); or, remotely view the screen in (3) through a compromised IoT camera.

Also, we assume that, in many cases, the attack involves multiple observations. For example, in scenario (1), the adversary can observe the presenter during multiple talks, without the presenter changing passwords between talks. Similarly, in scenario (2), customers often return to the same ATM.

Fig. 1.
figure 1

Example attack scenario.

Fig. 2.
figure 2

Attack example – ATM setting. (a) Adversary’s perspective. (b) Outsider’s perspective.

4 Overview and Data Collection

Recall that SILK-TV confines the information about the secret that the adversary can capture to inter-keystroke timings leaked by the output device while the user types a secret. The goal is to analyze differences between the distribution of inter-keystroke timings and infer corresponding keypairs. This data is used to identify the passwords that are most likely to be correct, thus restricting the brute-force search space of the secret. To accurately extract inter-keystroke timing information, we analyze video feeds of masking symbols, and identify the frame where each masking symbol first appears. In this setting, accuracy and resolution of inter-keystroke timings depends on two key factors: refresh frequency of the output device, and frame rate of the video camera. Inter-keystroke timings are then fed to a classifier, where classes of interest are keypairs. Since we assume that the adversary has no access to user-specific keystroke information, the classifier is trained on population data, rather than on user-specific timings.

In the rest of this section, we detail the data collection process. We collected password data from two types of output devices: a VGA-based external projector, and LCD screens of several laptop computers. See Sect. 4.1 for details of these devices and corresponding procedures. For PIN data, we video-recorded the screen of a simulated ATM. Details can be found in Sect. 4.2.

4.1 Passwords

We collected data using an EPSON EMP-765 projector, and using the LCD screens of the subjects’ laptops computers. In the projector setting, we asked the subjects to connect their own laptops so they would be using a familiar keyboard. The refresh rate of both laptop and projector screens were set to 60 Hz – the default setting for most systems. This setting introduces quantization errors of up to about 1 / 60 s \(\approx 16.7\) ms. Thus, events happening within the same refresh window of 16.7 ms are indistinguishable. We recorded videos of the screen and the projector using the rear-facing camera of two smartphones: Samsung Galaxy S5 and iPhone 7 Plus. With both phones, we recorded videos at 120 frames per second, i.e., 1 frame every 8.3 ms. To ease data collection, we placed the smartphones on a tripod. When recording the projector, the tripod was placed on a table, filming from a height of about 165 cm, to be horizontally aligned with respect to the projected image. When recording laptop screens, we placed the smartphone above and to the side of the subject, in order to mimic the adversary sitting behind the subject.

All experiments took place indoors, in labs and lecture halls at the authors’ institutions. We recruited a total of 62 subjects, primarily from the student population of two large universities. Most participants were males in their 20 s, with a technical background and good typing skills. We briefed each subject on the nature of the experiment, and asked them to type four alphanumerical passwords: “jillie02”, “william1”, “123brian”, and “lamondre”. We selected these passwords uniformly at random from the RockYou dataset [1] in order to simulate realistic passwords. The subjects typed each password three times, while our data collection software recorded ground-truth keystroke timings of correctly typed passwords with millisecond accuracy. Timings from passwords that were typed incorrectly were discarded, and subjects were prompted to re-type the password whenever a mistake was made. The typing procedure lasted between 1 and 2 min, depending on the subject’s typing skills. All subjects typed with the “touch typing” technique, i.e., using fingers from both hands.

4.2 PINs

We recorded subjects entering 4-digit PINs on a simulated ATM, shown in Fig. 3. Our dataset was based on experiments with 22 participants; 19 subjects completed three data collection sessions, while 4 subjects completed only one session, resulting in a total of 61 sessions. At the beginning of each session, the subject was given 45 s to get accustomed with the keypad of the ATM simulator. During this time, they were free to type as they pleased. Next, a subject was shown a PIN on the screen for ten seconds (Fig. 4a), and, once it disappeared from the screen, asked to enter it four times (Fig. 4b). Subjects were advised not to read the PINs out loud. This process was repeated for 15 consecutive PINs. During each session, subjects were presented with the same 15-PIN sequence 3 times. Subjects were given a 30-s break at the end of each sequence.

Fig. 3.
figure 3

Setup used in PIN inference experiments.

Fig. 4.
figure 4

ATM simulator during a data collection session. (a) The simulator displays the next PIN. (b) A subject types the PIN from memory.

Specific 4-digit PINs were selected to test whether: (1) inter-keypress time is proportional to Euclidean Distance between keys on the keypad; and (2) the direction of movement (up, down, left, or right) between consecutive keys in a keypair impacts the corresponding inter-key time. We show an example of these two situations on the ATM keypad in Fig. 5. We chose a set of PINs that allowed collection of a significant number of key combinations appropriate for testing both hypotheses. For instance, PIN 3179 tested horizontal and vertical distance two, while 1112 tested distance 0 and horizontal distance 1.

Fig. 5.
figure 5

ATM keypad in our experiments. (a) To type keypairs 1–2 and 1–4, the typing finger travels the same distance in different directions. (b) Keypairs 1–2 and 1–3 require the typing finger to travel different distances in the same direction.

Fig. 6.
figure 6

CDF showing error distribution of inter-keystroke timings extracted from videos.

Sessions were recorded using a Sony FDR-AX53 camera, with the pixel resolution of 1,920\(\,\times \,\)1,080 pixels, and 120 frames per second. At the same time, ATM simulation software collected millisecond-accurate inter-key distance ground truth by logging each keypress. PIN feedback was shown on a DELL \(17''\) LCD screen with a refresh rate of 60 Hz, which resulted to each frame being shown for 16.7 ms.

4.3 Timing Extraction from Video

We developed software that analyzes video recordings to automatically detect appearance of masking symbols and log corresponding timestamps. This software uses OpenCV [17] to infer the number of symbols present in each image. All frames are first converted to grayscale, and then processed through a bilateral filter [25] to reduce noise due to the camera’s sensor. Resulting images are analyzed using Canny Edge detection [9] to capture the edges of the masking symbol. External contours are compared with the expected shape of the masking symbol. When a masking symbol is detected, software logs the corresponding frame number.

Our experiments show that this technique leads to fairly accurate inter-keystroke timing information. We observed average discrepancy of 8.7 ms (stdev of 26.6 ms) between the inter-keystroke timings extracted from the video and ground truth recorded by the ATM simulator. Furthermore, 75% of inter-keystroke timings extracted by the software had errors under 10 ms, and 97% had errors under 20 ms. Similar statistics hold for data recorded on keyboards for the passwords setting. Figure 6 shows the distribution of error discrepancies.

5 Password Guessing Using SILK-TV

SILK-TV treats identifying digraphs from keystroke timings as a multi-class classification problem, where each class represents one digraph, and input to the classifier is a set of inter-keystroke times. Without loss of generality, in this section, we assume that the user’s password is a sequence of lowercase alphanumeric characters typed on a keyboard with a standard layout.

To reconstruct passwords, we compared two classifiers: Random Forest (RF) [13] and Neural Networks (NN) [19]. RF is a well-known classification technique that performs well for authentication based on keystroke timings [6]. Input to RF is one inter-keystroke timing, and its output is a list of N digraphs ranked based on the probability of corresponding to input timing. NN is a more complex architecture designed to automatically determine and extract complex features from the input distribution. In our experiments, the input to NN is a list of inter-keystroke timings corresponding to a password. This enables NN to extract features, such as arbitrary n-grams, or timings corresponding to non-consecutive characters. NN’s output is a guess for the entire password.

We instantiated NN using the following parameters:

  • number of units in the hidden layer – 128 (with ReLU activation functions);

  • inclusion probability of the dropout layer – 0.2;

  • number of input neurons – 25;

  • number of output layers – 25 which represents one character in one-hot encoding. Output layers use softmax activation function;

  • training was performed using batch sizes of 40 and 100 epochs. We used the Adam optimizer with a learning rate of 0.001.

Classifier Training. We trained SILK-TV on three public datasets [5, 18, 26] that contain keystroke timing information collected from English free-text. Using these datasets for training, we modeled an attack that relies exclusively on population data. Without loss of generality, we filtered the datasets to remove all timings that do not correspond to digraphs composed of alphanumeric lowercase characters. This is motivated by the datasets’ limited availability of digraph samples that contain special characters. In practice, the adversary could collect these timings using, for instance, crowdsourcing tools such as Amazon Mechanical Turk. To take care of uneven frequencies of different digraphs, we under-represented the most frequent digraphs in the dataset. Data in public datasets was often gathered from free-text typing of volunteers. Therefore, more frequent digraphs in English were represented more than rarer ones. For example, considering lamondre, digraph re appears 43,606 times in the population dataset, while am – only 6,481. Similarly, in 123brian, digraph ri occurs 19,782 times, while 3b – only 138. We therefore under-sampled each digraph appearing more than 1,000 times to 1,000 randomly selected occurrences. Similarly, we excluded infrequent digraphs that appeared under 100 times in the whole dataset.

Attack Process. To infer the user’s secret from inter-keystroke timings, SILK-TV leverages a dictionary of passwords (e.g., a list of passwords leaked by online services [1, 2, 10, 24]), possibly expanded using techniques such as probabilistic context-free grammars [29] and generative adversarial networks [12]. When evaluating SILK-TV, we assume that the user’s secret is in the dictionary. In practice, this is often the case, as many users use the same weak passwords (e.g., only 36% of the password of RockYou is unique [15]), and reuse them across many different services [11, 28]. Given that the size of a reasonable password dictionary is on the order of billions of entriesFootnote 2, the goal of SILK-TV is to narrow down the possible passwords to a small(er) list, e.g., to perform online attacks. This list is then ranked by the probability associated with each entry, computed from inter-keystroke timing data.

Specifically:

  1. 1.

    Using RF, for each inter-key time extracted from video (corresponding to a digraph), SILK-TV returns a list of N possible guesses, sorted by the classifier’s confidence. Next, SILK-TV ranks the passwords in the dictionary by resulting probabilities as follows: for each password, SILK-TV identifies the position in the ranked list of predictions for the first digraph of the password being guessed, and assigns that position as a “penalty” to the password. By performing these steps for each digraph, SILK-TV obtains a total penalty score for each password, i.e., a score that indicates the probability of the password given the output of the RF.

    For example, to rank the password jillie02, SILK-TV first considers the digraph ji, and the list of predictions of RF for the first digraph. It notes that ji appears in such list as the X-th most probable; therefore, it assigns X as the penalty for jillie02. Then, it considers il, which appears in Y-th position in the list of predictions for the second digraph. Penalty for jillie02 is thus updated to \(X+Y\). This operation is repeated for all the 7 digraphs, thus obtaining the final penalty score.

  2. 2.

    Using NN, SILK-TV computes a list of N possible guesses, sorted by the classifier’s confidence of each guess. In this case, the SILK-TV processes the entire list of flight times at once, rather than refining its guess with each digraph.

We considered the following attack settings: single-shot, and multiple recordings. With the former, the adversary trains SILK-TV with inter-keystroke timings from population data, i.e., from users other than the target, e.g., from publicly available datasets, or by recruiting users and asking them to type passwords. In this scenario, the adversary has access to the video recording of a single password entry session. With multiple recordings, the adversary trains SILK-TV as before, and additionally, has access to videos of multiple login instances by the same user.

Training SILK-TV exclusively with population data leads to more realistic attack scenarios than training it with user-specific data, because usually the adversary has limited access to keystrokes samples from the target user. Further, access to user-specific data will likely improve the success rate of SILK-TV.

5.1 Results

In this section, we report on SILK-TV efficacy in reducing search time on the RockYou [1] password dataset compared to random choice, weighted by probability. We restricted experiments to the subset of 8-character passwords from RockYou, since the adversary can always determine password length by counting the number of masking symbols shown on the screen. This resulted in 6,514,177 passwords, out of which 2,967,116 were unique.

Attack Baseline. To establish the attack baseline, we consider an adversary that outputs password guesses from a leaked dataset in descending order of frequency. (Ties are broken using random selection from the candidate passwords.) Because password probabilities are far from uniform (e.g., in RockYou, top 200 8-character passwords account for over 10% of the entire dataset), this is the best adversarial strategy given no additional information on the target user.

Passwords selected for our evaluation represent a mix of common and rare passwords. Thus, they have widely varying frequencies of occurrence in RockYou and expected number of attempts needed to guess each password using the baseline attack varies significantly. For example, expected number of attempts for:

  • 123brian (appears 6 times) – 93,874;

  • jillie02, (appears only once) – 1,753,571;

  • lamondre (appears twice) – 397,213;

  • william1 (appears 1,164 times) – only 187.

Single-shot. Results in the single-shot setting are summarized in Table 1. Cumulative Distribution Function (CDF) of successfully recovered passwords is reflected in Fig. 7, and breakdown of results (by target password) is shown in Fig. 8.

Fig. 7.
figure 7

CDF of the amount of passwords recovered by SILK-TVPopulation Data attack scenario.

Table 1. SILK-TVSingle-shot setting. Avg: average number of attempts to guess a password; Stdev: standard deviation; Rnd: number of guesses for the baseline adversary; <Rnd: how often SILK-TV outperforms random guessing; Best: number of attempts of the best guess; <n: how many passwords are successfully guessed within first \(n=\) 20,000/100,000 attempts.
Fig. 8.
figure 8

CDF for the number of passwords recovered by SILK-TV, for each target password. Plots also show the baseline attack for the corresponding password.

Results show that, for uncommon passwords (jillie02 and lamondre), SILK-TV consistently outperforms random guessing. In particular, for jillie02 both RF and NN greatly exceed random guessing, since both their curves in Fig. 8 are above random guess baseline. For lamondre, RF shows an advantage over random guess in 76% of the instances, while NN never beats the baseline.

For common passwords, sorted random guess wins over SILK-TV. In particular, 123brian is both popular (i.e., 93,874-th most popular password of the set, corresponding to the top 3% of the RockYou dataset) and very hard to recover with SILK-TV. This can be observed from Fig. 8, where the curves corresponding to 123brian are least steep. Finally, william1, being the 187-th most popular password, is always recovered early in our baseline attack, with the notable exception of one instance by RF.

In general, SILK-TV wins over the sorted random guess on infrequent passwords, such as jillie02 and lamondre, that appear only once or twice, respectively. Such infrequent passwords exhibit the same random guess baseline curve and average, reported in Table 1 and shown in Fig. 8. Given the similar steepness of CDF curves in Fig. 8, which hint that SILK-TV ’s performance might be similar for many other passwords, SILK-TV can probably outperform the baseline for uncommon passwords. We also note that uncommon passwords represent the vast majority of user-chosen passwords: 90% of RockYou passwords appear at most twice, and 80% exactly once. We expect that a realistic adversary would first generate password guesses based on their frequency alone (as in our baseline attack), and then switch to SILK-TV once these frequencies drop below some threshold.

Finally, we highlight that random guess baseline is computed on the distribution of passwords in RockYou. Other datasets might have different distributions: for example, in the 10 million password list dataset [7], jillie02, lamondre, and 123brian appear only once, while william1 appears 176 times.

We believe that the discrepancy between performance of SILK-TV on various passwords is due to how frequently the digraphs in each password appear in training data. Specifically, even with our under-representation, all digraphs in william1, with the exception of m1, are far more frequent in the training data than 12, 23, 3b, or 02.

Regarding specific classifiers, RF overtakes NN in most instances. For example, when guessing 123brian (Fig. 8a), NN performs worse than random guessing for first 800,000 attempts. Afterwards, NN outperforms both random guessing and RF. Furthermore, while RF can guess a substantial percentage of passwords within 20,000, 50,000 and 100,000 attempts, NN cannot achieve the same result.

In terms of minimum number of guesses per password, RF recovered william1 in 68, lamondre in 145, 123brian in 5,535, and jillie02 in 28,962 attempts. NN required a consistently higher minimum number of attempts for each password.

Multiple Recordings. Information from three login instances was used as follows. We averaged classifiers’ predictions over three login instances for a given user, and ranked passwords accordingly.

Table 2. SILK-TVMultiple recordings setting. Avg: average number of attempts to guess a password; Stdev: standard deviation; Rnd: number of guesses for the baseline adversary; <Rnd: how often SILK-TV outperforms random guessing; Best: number of attempts of the best guess; <n: how many passwords are successfully guessed within first \(n=\) 20,000/100,000 attempts.
Fig. 9.
figure 9

CDF showing number of passwords recovered by SILK-TV in the Multiple recordings scenario.

Results are summarized in Table 2, and Fig. 9. Although SILK-TV still consistently outperforms random guessing, using data from multiple authentication recordings leads to mostly identical results overall with both RF and NN. SILK-TV ’s guessing success rate for 123brian and jillie02 is slightly improved compared to the previous setting and minimum number of attempts to recover each password diminished slightly. We recovered william1 in 19, lamondre in 404, 123brian in 13,931, and jillie02 in 67,875 attempts. Overall, results show that there are no substantial benefits in using timing data from three recordings from the same user.

6 PIN Guessing Using SILK-TV

We now discuss PIN-related results, specifically, relationships between: (1) inter-keystroke timings and Euclidean Distance between consecutive keys, and (2) inter-keystroke timings and direction of movement on the keypad.

We are not aware of any publicly-available PIN timing datasets that can be used to train SILK-TV. To address this issue, we divided our dataset in two parts. The first was used as training, and the second – as testing, data. To compute the attack baseline, we considered all PINs to be equally likely.

Distance. Across all subjects, we observed that distributions of inter-keystroke latencies were distinct in all cases (for p-value \(<5\cdot ~10^{-6}\)), with the following exceptions: (1) latencies for distance 2 (e.g., keypair 1–3) were close to latencies for distance 3 (keypair 2–0); (2) latencies for distance 2 were close to latencies for diagonal 1\(\,\times \,\)1 (e.g., keypair 4–8); latencies for distance 3 were close to latencies for 2\(\,\times \,\)1 diagonal (i.e. “2” to “9”, “1” to “6”, etc.), and diagonal 2\(\,\times \,\)2 (e.g., keypair 7–3), and diagonal 3\(\,\times \,\)2 (e.g., keypair 3–0). Figure 10a shows the various probability distributions, while Fig. 10b models these different probability distribution functions as gamma distributions. In Fig. 10a, dist_zero indicates keypairs composed of the same two digits. dist_one, dist_two, and dist_three shows timings distributions for keypairs with horizontal or vertical distance one (e.g., keypair 2–5), two (e.g., 2–8), and three (2–0), respectively. dist_diagonal_one and dist_diagonal_two indicates keypairs with diagonal distance one (e.g., 2–4) and distance two (e.g., 1–9), respectively. dist_dogleg and dist_long_dogleg show timing distributions of keypairs such as 1–8 and 0–3. In Fig. 10b, dist_one_horizontal and dist_one_vertical indicate Euclidean Distance right in the left/right directions, and up/down directions, respectively, while dist_one_up, dist_one_down, dist_one_left, and dist_one_right indicate distances one in the up, down, left, and right directions.

Fig. 10.
figure 10

Inter-keystroke timings of all possible distances for ATM keypad typing.

Direction. The relative orientation of key pairs characterized by the same Euclidean distance (e.g., 2–3 vs. 2–5) has a negligible impact on the corresponding inter-key latency. We observed that the distributions of keypress latencies observed from each possible direction between keys were not significantly different (for p-value \(<10^{-4}\)). Figure 11 shows different probability distributions relative to various directions for Euclidean distance 1.

Fig. 11.
figure 11

Frequency of inter-keystroke timings for Euclidean Distance of one. dist_one indicates latency distribution for distance one in any direction.

Fig. 12.
figure 12

CDF showing the number of PINs recovered by SILK-TV, compared to the baseline.

6.1 Pin Inference

Using the data we collected, we mapped the distribution of inter-keypress latencies, and used the resulting probabilities to test the effectiveness of PINs prediction from inter-key latencies.

To guess PINs from our inter-key latencies, we used data from 14 users to model the inter-key latencies as gamma distributions. Then, we tested the data from the remaining users. Figure 12 shows the effectiveness of these predictions compared to brute-force guesses. Due to the lack of separation between the distribution of most distances and directions, the improvement compared to brute force is small (in the −1% to 4% range), leading to an average reduction in guessing attempts of about 3.8%.

7 Conclusion

In this paper, we have shown that inter-key timing information disclosed by showing password masking symbols can be effectively used to reduce the cost of password guessing attacks. To determine the impact of this side channel, we recorded videos from 84 subjects, typing several passwords and PINs under different conditions: in a lecture hall, while their laptop was collected to a projector; in a classroom setting; and using a simulated ATM machine. Our results show that: (1) it is possible to infer very accurate timing information from videos of LCD screens and projectors (the average error was 8.7ms, which is corresponds with the duration of a frame when the refresh rate of a display is set to 60 Hz); (2) inter-keystroke timings reduce the number of attempts to recover a password by 25% and 385%, with some passwords guessed within 19 attempts. We consider this a substantial reduction in the cost of password guessing attacks, to the point that we believe that masking symbols should not be publicly displayed when typing passwords; and (3) disclosing inter-keystroke timings have a relatively small impact on PIN guessing attacks (the average reduction in the number of attempts required to guess a 4-digit PIN was 3.8%).

Clearly, the benefits of SILK-TV compared to our baseline attack vary depending on how common the user’s password is. For very common (and therefore very easy to guess) passwords, our results show that SILK-TV might not be needed. On the other hand, the speedup offered by SILK-TV when guessing rare passwords is substantial. Given the effectiveness of this attack on password guessing, we think that future work should consider countermeasures that strike the right balance between usability and security when displaying masking symbols. For instance, GUIs may not display masking symbols on a secondary screen (e.g., projectors), or may display new masking symbols at fixed intervals (say, every 250ms). Clearly, both countermeasures have usability implications, and we leave the quantification of this impact to future work.