Elsevier

Pervasive and Mobile Computing

Volume 35, February 2017, Pages 83-107
Pervasive and Mobile Computing

Anomaly detection for smartphone data streams

https://doi.org/10.1016/j.pmcj.2016.07.006Get rights and content

Abstract

Smartphones centralize a great deal of users’ private information and are thus a primary target for cyber-attack. The main goal of the attacker is to try to access and exfiltrate the private information stored in the smartphone without detection. In situations where explicit information is lacking, these attackers can still be detected in an automated way by analyzing data streams (continuously sampled information such as an application’s CPU consumption, accelerometer readings, etc.). When clustered, anomaly detection techniques may be applied to the data stream in order to detect attacks in progress. In this paper we utilize an algorithm called pcStream that is well suited for detecting clusters in real world data streams and propose extensions to the pcStream algorithm designed to detect point, contextual, and collective anomalies. We provide a comprehensive evaluation that addresses mobile security issues on a unique dataset collected from 30 volunteers over eight months. Our evaluations show that the pcStream extensions can be used to effectively detect data leakage (point anomalies) and malicious activities (contextual anomalies) associated with malicious applications. Moreover, the algorithm can be used to detect when a device is being used by an unauthorized user (collective anomaly) within approximately 30 s with 1 false positive every two days.

Introduction

In 2016, over two billion people will have a smartphone as a part of their daily lives  [1]. Smartphones provide a means of communication, as well as a central location to store and organize information, a quality which makes the popular devices enticing targets for attackers interested in stealing private information  [2]. One way to protect a smartphone is to perform anomaly detection  [3]. In this approach, the normal behavior of an internal or external actor is modeled so that malicious activities can be detected as anomalies, behaviors which do not fit the norm. Subsequently the detected malicious activity can be blocked, thereby protecting the user. For example, take the malicious behavior of sending SMSs to premium numbers (monetary theft). In this case, explicit information such as the message’s textual content or the destination number could be used to directly determine whether the SMS is anomalous. However, in some cases explicit information is unavailable or insufficient, making it a challenge to detect the SMS as malicious. For instance, if the SMS contains legitimate text (stolen from the user’s outbox), or if there is no complete list of premium numbers to blacklist.

In cases where explicit information is lacking, many times contextual information, often in the form of a data streams, is available. Contextual information is the additional information that assists in clarifying a particular event or behavior  [4]. In this paper, we refer to data streams as unbounded sequences of measurements sampled continuously from a particular source. Modern smartphones are equipped with a wide range of sources which can be sampled in order to generate a data stream rich in contextual information. For instance, Android smartphones allow all applications to receive information about the phone’s other applications (including statistics related to the CPU, memory usage, and system priorities) through the Linux virtual/proc/folder  [5]. Furthermore, information on device motion and device status is available as well. When sampled, these data streams can be used to capture the actor’s contexts in order to detect possible anomalies. Returning to our example, let us assume that the device’s motion sensors are sampled when SMSs are sent. In this scenario, by utilizing the contextual information from the sensor’s data stream, it becomes easier to differentiate between a physical human (external actor) sending an SMS and some automated malicious code (internal actor) which is sending the SMS.

Here are additional examples of situations in which explicit information for the detection of anomalies is lacking but contextual data streams are available:

  • Outbound encrypted transmissions: typically a malware (such as a bot) sends data back to its command and control server (C&C) over an encrypted channel. Therefore, semantic analysis or other explicit information about the data in motion is not obtainable. However, the context of the encrypted transmission, along with other details about transmissions captures information useful in determining a transmission’s legitimacy  [6].

  • Activities of applications in the Android OS platform: Applications running in Android are sandboxed in separate Dalvik virtual machines (DVM)  [7]. At a basic level, this prevents applications from accessing other applications’ data and resources without explicit user permissions  [8]. In order to gain full access, a device must be rooted, giving all applications access to privileged commands within Android’s subsystems, a security risk in itself. Since devices are shipped unrooted by default, antivirus applications available through the Android marketplace are highly limited in the dynamic analysis (online scanning) they can perform in order to detect anomalies. However, without root privileges applications can obtain contextual information by sampling other applications’ statistics (e.g., CPU utilization, memory usage, etc.)

  • High level inference from low level trust-zones: Understanding what applications are doing from a low level trust zone (e.g., hypervisor) is difficult, because kernel (as well as DVM) information is not directly accessible. For instance, it is not clear what application is sending data or which process is currently in the foreground legitimately using the CPU. However, the motion data and screen on/off data are available from the hypervisor. Using this data, it may be possible to detect illegitimate transmissions as they occur.

In order to utilize the contextual information in a data stream, one needs to mine the stream for the hidden contexts. The hidden contexts found in a real world data stream can exhibit behaviors in the form of correlated distributions (clusters)  [9]. These clusters of observations are referred to in literature as the concepts or contexts captured by the data  [9], [10], [11], [12]. By modeling these contexts, it is possible to detect anomalies in an unsupervised manner  [13], [14]. However, the detection of anomalies in data streams is a challenging process. This is because data streams are unbounded in length and involve recurring concepts as well as concept drifts  [15]. These properties make it difficult to distinguish between a previously seen concept which is now changing and a new concept (an anomaly) which has not been seen before. Current stream clustering algorithms can detect and track concept drifts  [16], [17]. However, (1) they were not specifically designed to detect various types of anomalies found in data streams, and (2) they are not able to distinguish between clusters which overlap in geometric space. The reason they cannot differentiate between overlapping clusters is because these algorithms seek to form geometric partitions of the feature space, and therefore do not respect the ground truth. To illustrate this issue, Fig. 1 plots the temporal concepts found while performing activity recognition using a smartphone’s accelerometer. Here any partition of the feature space into three clusters will not respect the ground truth (that the clusters formed from each activity overlap in geometric space).

In this paper we propose a solution for the detection of various types of anomalies in data streams which have overlapping clusters. In a previous work we proposed pcStream: a stream clustering algorithm used to dynamically detect and manage temporal contexts  [18]. The name “pcStream” is based on the principal components of the distributions in the data stream which are used to dynamically detect and compare the underlying contexts. One of the advantages of pcStream is that it can detect and model overlapping concepts. Detection and management is accomplished by taking into account the temporal relation of the stream’s observations (i.e., temporal contexts). This is the main reason for pcStream’s ability to outperform state-of-the-art stream clustering algorithms in detecting contexts in real world data streams (the reader is invited to view our original paper for the analysis  [18]). Moreover, pcStream tracks concept drifts, keeping the captured contexts relevant and up to date.

In this work, we propose three extensions to the pcStream algorithm. Each extension designed to detect a different type of anomaly found in data streams generated by smartphones. The general approach is to train pcStream on a data stream which captures the normal contexts (including those which overlap). We can then use the trained model to detect point, contextual, or collective anomalies using the respective extension. The concept for deploying these extensions is to have a security agent running on the device. By sampling the relevant numeric features (e.g., an application’s CPU usage), the agent observes the actor for unexpected behaviors which it can either block or inform the user about—thereby providing implicit smartphone security.

Therefore the contributions of this paper are as follows:

  • (1)

    Introduction of a single algorithm, pcStream, for detecting point, contextual, and collective anomalies found in temporal data streams, in particular, data streams whose concepts overlap each other in geometric space (i.e., feature space). We make the source code for this algorithm available online, with versions written in R, Matlab, Python, and PySpark (for Hadoop clusters).1

  • (2)

    Evaluation of pcStream in the realm of smartphone security. We explore examples of how pcStream can be used as a security solution for addressing current smartphone security threats, specifically, the detection of data leakage, the detection of active malware in dynamic analysis, and the provision of continuous user authentication, all achieved by analyzing data streams.

To evaluate pcStream as an anomaly detection tool, we collected a dataset consisting of eight months of sensor data from 31 volunteers all of whom used Samsung Galaxy S5 smartphones. In order to detect short-term anomalies (such as a malicious transmission or a device theft), we sampled the devices’ sensors at a high temporal resolution. Other existing datasets do not have this level of resolution for a long period of time. This is most likely because many sensors have significant power consumption. To overcome this challenge, we provided the volunteers with battery cases that nearly triple the battery life of the device (ensuring a total battery life of 9–10 h on a full charge).

The remainder of this paper is structured as follows. Section  2, reviews the pcStream algorithm. Section  3 presents the different types of anomalies addressed in this paper, and proposes the anomaly detection extensions to the pcStream algorithm. Section  4 presents the dataset used, evaluation setup, and evaluation results. Section  6 reviews related works. Section  5 provides a discussion on related issues. Finally, Section  7 provides a conclusion, and proposes future work.

Section snippets

Review of the pcStream algorithm

In this section, we briefly review the basic pcStream algorithm as presented in  [18].

pcStream anomaly detection extensions

In general, there are three types of anomalies: point anomalies, contextual anomalies, and collective anomalies  [13]. In this section, we present three extensions to pcStream which enable it to detect the three anomaly types. In Section  4, we evaluate these added capabilities as implicit smartphone security solutions. Fig. 2 is an illustration referenced throughout this section that visualizes how pcStream views each type of anomaly.

Evaluation

In this section, we evaluate pcStream as an implicit security solution for the smartphone using the extensions described in Section  3. Specifically, we test whether it is possible to detect data leakage (point anomalies), active malware (contextual anomalies), and unauthorized users via continuous authentication (collective anomalies).

As a comparative baseline in the evaluation of pcStream, we evaluate two other stream clustering algorithms: DBStream  [26] and D-Stream  [27]. These algorithms

Discussion

Parameter selection. Like many machine learning algorithms, pcStream has multiple parameters, namely tmin and ϕ. Finding the optimum parameters can be challenging, especially since stream datasets can be very long and therefore involve a lengthy training time. In this paper we performed a non-exhaustive gridsearch over these parameters, where each point in the grid is a set of parameters. For larger problems, one may consider using a hyper-parameter selection technique which performs a minimal

Related literature

As mentioned, in this study we aim to detect anomalies when explicit information on the application, device, or user behavior is unavailable. In such cases, we opt to analyze sensor data to extract the latent contexts which can be used to detect anomalies.

With the evolvement of smartphones, the importance of implementing context-aware security for such devices has been acknowledged. One example includes the CrePE system  [45] which introduced a context-based policy enforcement on applications.

Conclusion

In mobile security, there are situations where it is not possible to explicitly determine the legitimacy of various actors (internal and external). However, in many of these situations, implicit information in the form of data streams can be collected and used for detecting anomalies. Therefore, in this paper we proposed an extension to the pcStream algorithm, enabling it to detecting point, contextual, and collective anomalies in contextual data streams. To evaluate the algorithm’s capability

Acknowledgment

This research had been funded by the Israeli Ministry of Science, Space and Technology.

References (53)

  • M.B. Harries et al.

    Extracting hidden context

    Mach. Learn.

    (1998)
  • G. Widmer et al.

    Learning in the presence of concept drift and hidden contexts

    Mach. Learn.

    (1996)
  • G. Widmer

    Tracking context changes through meta-learning

    Mach. Learn.

    (1997)
  • J.a.B. Gomes et al.

    Calds: Context-aware learning from data streams

  • V. Chandola et al.

    Anomaly detection: A survey

    ACM Comput. Surv. (CSUR)

    (2009)
  • A. Tsymbal, The problem of concept drift: definitions and related work, Computer Science Department, Trinity College...
  • C.C. Aggarwal, A survey of stream clustering algorithms,...
  • J.A. Silva et al.

    Data stream clustering: A survey

    ACM Comput. Surv.

    (2013)
  • Y. Mirsky et al.

    pcstream: A stream clustering algorithm for dynamically detecting and managing temporal contexts

  • A. Padovitz, S.W. Loke, A. Zaslavsky, Towards a theory of context spaces, in: Pervasive Computing and Communications...
  • J. Shlens, A tutorial on principal component analysis, arXiv Preprint...
  • B. Babcock et al.

    Maintaining variance and k-medians over data stream windows

  • S. Wold, M. Sjostrom, Simca: a method for analyzing chemical data in terms of similarity and analogy,...
  • X. Song et al.

    Conditional anomaly detection

    IEEE Trans. Knowl. Data Eng.

    (2007)
  • Pausing and resuming an activity, http://developer.android.com/training/basics/activity-lifecycle/pausing.html...
  • M. Dunham, Y. Meng, J. Huang, Extensible Markov model, in: Data Mining, 2004. ICDM’04. Fourth IEEE International...
  • Cited by (18)

    • Network traffic analysis over clustering-based collective anomaly detection

      2022, Computer Networks
      Citation Excerpt :

      Alternative Algorithms. We compare CCAD algorithm with two Clustering-based detection algorithms of PCstream [5]and LDCOF [10] in the evaluating indicator, time efficiency and memory efficiency. Then, we compare CCAD algorithm with the MCOD solutions from the literature [11], which about the point anomaly detection algorithm.

    • Real-time big data processing for anomaly detection: A Survey

      2019, International Journal of Information Management
      Citation Excerpt :

      Choosing parameter can be demanding for very extensive training time. The framework uses non-exhaustive grid search over this parameter to establish network parameter (Mirsky, Shabtai, Shapira, Elovici, & Rokach, 2017). Nevertheless, there is a need to determine the proficiency of the proposed framework to be in line with current approach of anomaly detection.

    • Smartphones as an integrated platform for monitoring driver behaviour: The role of sensor fusion and connectivity

      2018, Transportation Research Part C: Emerging Technologies
      Citation Excerpt :

      There is a need to develop intelligent smartphone algorithms that can filter streamlined data and detect informative events. It will reduce the requirement to transmit all the raw data centrally but the most informative features (Kanarachos et al., 2015; Martinez et al., 2017; Mirsky et al., 2017; Vasconcelos et al., 2017). In this context, a promising direction is compressive sensing and similar algorithms.

    • A Survey of Collective Anomaly Detection on Sequence Dataset

      2023, International Journal of Data Warehousing and Mining
    View all citing articles on Scopus
    View full text