1 Introduction

In this era enormous use of automated system together with the cloud based means gives a broader perspective to end user for storing as well as accessing data in an efficient manner. However it throws a big challenge to security and authentication domain. Prior to access the secured data, it is essential to verify the authenticity of the user. Determining the relevancy of the user with respect to the data is foremost agenda of authentication. Most of the advanced systems in different application working with distributed workstations (servers) deployed over different geographic region. The security of user and his/her data becomes more vulnerable in the wireless medium as there is no dedicated link or method specified over there. We need a foolproof measure against unauthorized access to computer resources and data. The traditional authentication techniques were mostly depended on password based methods. The traditional techniques fail to provide enough protection to the user data. This has prompted the researchers to identify a new area of authentication known as Biometrics, which include finger prints, palm veins; face recognition, DNA, palm print, hand geometry, iris recognition, pattern of human behavior, like- key typing rhythm, ETC. Keystroke dynamics [1, 6] or typing dynamics is a behavioral biometric, refers to the automated method of identifying or confirming the identity of an individual based on the manner and the rhythm of typing on a keyboard. The keystroke techniques are of two type - Keystroke Static authentication (KSA) and Keystroke dynamic authentication (KDA). In Static keystroke based technique, user authentication is done at a particular time instance. The continuous/ dynamic keystroke method is more effective than KSA and it requires the verification process to be continued during the entire session of user interaction. The raw measurements used for keystroke dynamics are dwell time and flight time.

The rest of the paper is organized as follows. In Sect. 2, we review various approaches in keystroke biometrics briefly and analyze their error rates. In Sect. 3, our proposed approach is described. We give a full detail of the implementations of the approaches and provide experimental results in Sect. 4. Finally, Sect. 5 concludes the paper with suggestions for future work (Fig. 1).

Fig. 1.
figure 1

Classification of biometric authentication

2 Literature Review

Most of the existing approaches focus on static verification, where a user types specific pre-enrolled string, e.g., a password during a login process, and then their keystroke features are analyzed for authentication purposes [1]. Pin et al. [2] proposed a solution with EER of 1.401 % for strengthening existing password based authentication system by using two layer fusion approach. Using classification techniques based on template matching and Bayesian likelihood models Fabian Monrose [3] achieved accuracy level of 83.22–92.14 %. Yu et al. [4] recommended nearest neighbor classifier with the new distance metric in order to identify a legitimate user with respect to a threshold value; this system achieved EER of 8.7 %. Kenneth Revett et al. [5] achieved 95 % of accuracy in user authentication by inventing software based module where combination of the typing speed and the first and last few characters of the login ID is enough to identify an authenticate user. Wang et al. [7] introduced a new user authentication approach by using keystroke dynamic method. This method includes training and authentication. It showed better performance in term of FAR and FRR. Babaeizadeh [8] suggested a KDA based system for verifying a user while requesting for services via CSP in Mobile cloud computing (MCC) environment. The proposed ECC cryptographic algorithm along with keystroke duration attribute was proved to defend 97.33 % of efforts for an imposter attack. The data quality, uniqueness and consistency of typing pattern can be improved by using artificial rhythms and tempo cues [9].

3 The Proposed System

Biometric authentication systems usually have two phases for verification purpose- Enrolment Phase and Authentication Phase. In enrolment phase user data is gathered, processed and stored in a database. This becomes a template for future authentication phase. In authentication phase, the user data is acquired and processed. A matching process is there to check the authenticity of the user based on his pre-stored reference templates.

Our fundamental objective is to generate a unique signature for each individual way by analyzing his/her typing behavior. The proposed system will capture user data on a continuous basis and it use the concept of free-text (i.e. no dedicated text to be provided by the user in order to create individual’s profile). In brief, the characteristics of the proposed model are:

  1. 1.

    Keystrokes based continuous authentication.

  2. 2.

    Dynamic (all text editor based data collection.

  3. 3.

    Unique signature vector for each user.

The proposed logic has three sub-phases for identifying a user’s unique behavior, these are: data collection, Preprocessing of stored data and signature vector generation.

Our proposed system depicted in Fig. 2 focused on generating a unique typing behavior of each individual.

Fig. 2.
figure 2

Block diagram of the proposed system.

Here is a brief description of each sub-phase:

  • Data Acquisition: Here raw keystroke data of individuals are collected via various input devices. These may consist of normal computer keyboard, customized pressure sensitive keyboard, virtual keyboard etc [10, 11, 12]. The output of this phase is a text file of an individual’s typing behavior with key dwell time and key hold time.

  • Data Preparation: Pre-processing procedures such as feature selection, dimension reduction, and outlier detection [13] are to be applied to the collected samples prior to feature extraction to ensure or to increase the quality of feature data. A substantial number of data samples are collected for each individual.

  • Signature Vector Generator: The output of phase II is used as input in this phase. This file is used to generate a unique signature for each individual by applying some rules on the identified features and store them in database for future classification.

3.1 Data Acquisition

For the purpose of the work we have designed a routine to collect user data (key typing behavior). This routine aims to collect events generated by individuals (operators of computer systems) while using a keyboard. At present, the system works on the MS-Windows platform and does not require any additional libraries. The proposed logic works continuously in background and records a user’s activity associated with a keyboard. The events are captured on the fly and saved in text files user character [user_id, vi] in a database. A sample of collected input data is presented below.

Input data collection is carried out for each user user_id separately. We can represent each key event as a vector with 5 tuples. On ith Session the key pressed event represented with vector vi is as follows,

$$ {\text{v}}_{\text{i}} = \left\{ {Session\_ID,{\text{ key}}\_{\text{name}}_{\text{i}} ,{\text{ hold}}\_{\text{time}}_{\text{i}} ,{\text{ dwell}}\_{\text{time}}_{\text{i}} ,{\text{ sys}}\_{\text{time}}_{\text{i}} } \right\} $$

where, key_namei is the name of the ith key pressed event, naming convention is according to standard QWERTY keyboard interface on the session with Session_ID; hold_timei is the timestamp difference between key pressed and key released; dwell_timei is the timestamp difference between (i−1) th key release and ith key pressed; sys_timei is the system generated time in hour and minute when the event occur.

V is the composite vector {v1, v2 … vn}; n depends on the overall key press occur on each session on a single day. In practice we restrict the number of sample data collected from the user hence our database is a collection of SV = {V1, V2 …, Vm} where m is number of sample data collected for each uid.

Additionally we store the total number of BACKSPACE key-press during each session the user interact with his/her machine. The sample collected for each session for the BACKSPACE key can be described with a vector TB = {TB1, TB2 … TBl}; where L = number of sessions on a single day, andTBj = {sessionj, backspace_countj};

where backspace_countj is the total number of times BACKSPACE key is pressed in sessionj. Then we compute the average number of BACKSPACE key-press on a single day and store them into the database with day_id. The average number of the BACKSPACE key-press (AB) on kth day is calculated as follows;

$$ {\text{AB}}_{\text{k}} = \frac{1}{L}\,\sum\nolimits_{j = 1}^{L} {TBj} ,{\text{ where L is the total number of sessions on k}}^{\text{th}} {\text{day}}. $$

All ABk will constitute a vector Ai = {uid, dayi, ABi} describe the average number of BACKSPACE key-press on ith day by the user uid (Table 1).

Table 1. Key_event_recoder

3.2 Data Preparation

In this phase we select unique features for generating individual signature. For this, key_hold_time and key_dwell_time are selected for analysis. We aim to generate a specific range for each key event for these two features.

Our database stored the collected sample in the form of vector SV = {V1, V2 …, Vm} where m is total number of session for each uid on a particular day. The preprocessing done on Vi, where Vi = {v1, v2 … vn}; n = number of key pressed on ith session.

We sort the key pressed event in a session and measure the maximum and minimum holding time of the key event (k). Store the range of key holding time and check for update on next sessions. Finally we get a list for each key_event (k) for Day (d) with specified range for user u_id and store them into database in the form of vector K_H {day_id, key_eventk, max_hold_timek, min_hold_timek}. max_hold_timek and min_hold_timek defines the range for key holding time for kth key_event on day day_id.

For key_dwell_time feature, we make a pairing between adjacent keys (k, k + 1) and store the pair-wise dwell time. In each session, we select the same pairs and list all the dwell_time values. This way, a range for all possible key-pairs is obtained for a day, and stored as vector K_D {day_id, key_pairj, max_dwell_timej, min_dwell_timej} per user (Tables 2 and 3).

Table 2. Key_Hold_Time
Table 3. Key_dwell_Time

3.3 Signature Vector Generator

In order to generate a template for individual uid we constructed a unique signature vector for each individual. Our feature space has 3 attributes (features); key_hold_time (kh), key_dwell_time (kd) and Backspace_key_count (bkc). For template creation we consider first two features from the feature space.

After the preprocessing of the input data stored in the form of K_H and K_D vector in our repository we proceed to generate a signature vector S_V for each user.

$$ {\text{U}}\_{\text{V}} = \left\{ {{\text{u}}\_{\text{id}},{\text{ Avg}}\_{\text{hold}}\_{\text{time}},{\text{ Avg}}\_{\text{dwell}}\_{\text{time}}} \right\} $$

Avg_hold_time derived from max_hold_time, min_hold_timek \( \in \) K_H vector for \( \forall k \in Key \), Key comprises of all key event possessed by the user for the entire sample collection period. Similarly, max_dwell_timej, min_dwell_timej \( \in \) K_D used for obtaining Avg_dwell_time for \( \forall \)kp \( \in Key\_Pair \).

4 Experimental Results

We collected the data-sets from 10 participants for 10 days. The sample data-set collected for each individual shown in Table 4.1 based on session on a day. The users were asked to run our proposed application in background during the entire period of interaction with their dedicated machine. The sample data collected from different machine having different configuration.

Table 4. Signature_vector_generator

The users were not bound to press any dedicated text string and there is no additional interface for capturing data. All the active windows accessed by the users were taken into consideration for generating sample data-set.

The collected samples for each user on a particular day then sorted in alphabetic order of key events. The processed samples depicted in Tables 4.2 and 4.3 for hold time and dwell time features respectively (Tables 5, 6, 7 and 8).

Table 5. Sample collected for an individual on a session.
Table 6. Processed data for key hold time feature per user
Table 7. Processed data-set for key dwell time feature per user
Table 8. Signature vector set for a particular user in the enrolled data-set.

We differentiate the user behavior based on the two unique feature discussed so far, i.e., hold time and dwell time. [Tables 4.2 and 4.3] illustrate the comparative analysis of two user USER1 and USER2 depending on key Hold time and key dwell time feature (Figs. 3 and 4).

Fig. 3.
figure 3

Average key holding time for all possible keys

Fig. 4.
figure 4

Average dwell time for a key pair

5 Conclusion

We have observed that the prevalent biometrics based techniques for identification of a legitimate user often suffered from high FAR and FPR rates, which had a negative effect on the respective accuracy rate. The study reveals a fact that most of the developed applications consider a dedicated text (mainly passwords of specific format) to be typed by the user. However, the fixed text examples failed to capture significant variations in individual typing due to limited characters used. In this paper, we have used free-text concept to solve this issue. The software for collecting user data is designed to be machine independent, and samples are collected from a varying set of computers. Our proposed signature vectors deal with all possible key events so that the aggregated behavior of the end user is stored in to the repository. Our future work will concentrate on the classification verification part of the individual based on these store templates.