Anomaly detection for smartphone data streams

doi:10.1016/j.pmcj.2016.07.006

Pervasive and Mobile Computing

Volume 35, February 2017, Pages 83-107

https://doi.org/10.1016/j.pmcj.2016.07.006 Get rights and content

Abstract

Smartphones centralize a great deal of users’ private information and are thus a primary target for cyber-attack. The main goal of the attacker is to try to access and exfiltrate the private information stored in the smartphone without detection. In situations where explicit information is lacking, these attackers can still be detected in an automated way by analyzing data streams (continuously sampled information such as an application’s CPU consumption, accelerometer readings, etc.). When clustered, anomaly detection techniques may be applied to the data stream in order to detect attacks in progress. In this paper we utilize an algorithm called pcStream that is well suited for detecting clusters in real world data streams and propose extensions to the pcStream algorithm designed to detect point, contextual, and collective anomalies. We provide a comprehensive evaluation that addresses mobile security issues on a unique dataset collected from 30 volunteers over eight months. Our evaluations show that the pcStream extensions can be used to effectively detect data leakage (point anomalies) and malicious activities (contextual anomalies) associated with malicious applications. Moreover, the algorithm can be used to detect when a device is being used by an unauthorized user (collective anomaly) within approximately 30 s with 1 false positive every two days.

Introduction

In 2016, over two billion people will have a smartphone as a part of their daily lives [1]. Smartphones provide a means of communication, as well as a central location to store and organize information, a quality which makes the popular devices enticing targets for attackers interested in stealing private information [2]. One way to protect a smartphone is to perform anomaly detection [3]. In this approach, the normal behavior of an internal or external actor is modeled so that malicious activities can be detected as anomalies, behaviors which do not fit the norm. Subsequently the detected malicious activity can be blocked, thereby protecting the user. For example, take the malicious behavior of sending SMSs to premium numbers (monetary theft). In this case, explicit information such as the message’s textual content or the destination number could be used to directly determine whether the SMS is anomalous. However, in some cases explicit information is unavailable or insufficient, making it a challenge to detect the SMS as malicious. For instance, if the SMS contains legitimate text (stolen from the user’s outbox), or if there is no complete list of premium numbers to blacklist.

In cases where explicit information is lacking, many times contextual information, often in the form of a data streams, is available. Contextual information is the additional information that assists in clarifying a particular event or behavior [4]. In this paper, we refer to data streams as unbounded sequences of measurements sampled continuously from a particular source. Modern smartphones are equipped with a wide range of sources which can be sampled in order to generate a data stream rich in contextual information. For instance, Android smartphones allow all applications to receive information about the phone’s other applications (including statistics related to the CPU, memory usage, and system priorities) through the Linux virtual/proc/folder [5]. Furthermore, information on device motion and device status is available as well. When sampled, these data streams can be used to capture the actor’s contexts in order to detect possible anomalies. Returning to our example, let us assume that the device’s motion sensors are sampled when SMSs are sent. In this scenario, by utilizing the contextual information from the sensor’s data stream, it becomes easier to differentiate between a physical human (external actor) sending an SMS and some automated malicious code (internal actor) which is sending the SMS.

Here are additional examples of situations in which explicit information for the detection of anomalies is lacking but contextual data streams are available:

•
Outbound encrypted transmissions: typically a malware (such as a bot) sends data back to its command and control server (C&C) over an encrypted channel. Therefore, semantic analysis or other explicit information about the data in motion is not obtainable. However, the context of the encrypted transmission, along with other details about transmissions captures information useful in determining a transmission’s legitimacy [6].
•
Activities of applications in the Android OS platform: Applications running in Android are sandboxed in separate Dalvik virtual machines (DVM) [7]. At a basic level, this prevents applications from accessing other applications’ data and resources without explicit user permissions [8]. In order to gain full access, a device must be rooted, giving all applications access to privileged commands within Android’s subsystems, a security risk in itself. Since devices are shipped unrooted by default, antivirus applications available through the Android marketplace are highly limited in the dynamic analysis (online scanning) they can perform in order to detect anomalies. However, without root privileges applications can obtain contextual information by sampling other applications’ statistics (e.g., CPU utilization, memory usage, etc.)
•
High level inference from low level trust-zones: Understanding what applications are doing from a low level trust zone (e.g., hypervisor) is difficult, because kernel (as well as DVM) information is not directly accessible. For instance, it is not clear what application is sending data or which process is currently in the foreground legitimately using the CPU. However, the motion data and screen on/off data are available from the hypervisor. Using this data, it may be possible to detect illegitimate transmissions as they occur.

In order to utilize the contextual information in a data stream, one needs to mine the stream for the hidden contexts. The hidden contexts found in a real world data stream can exhibit behaviors in the form of correlated distributions (clusters) [9]. These clusters of observations are referred to in literature as the concepts or contexts captured by the data [9], [10], [11], [12]. By modeling these contexts, it is possible to detect anomalies in an unsupervised manner [13], [14]. However, the detection of anomalies in data streams is a challenging process. This is because data streams are unbounded in length and involve recurring concepts as well as concept drifts [15]. These properties make it difficult to distinguish between a previously seen concept which is now changing and a new concept (an anomaly) which has not been seen before. Current stream clustering algorithms can detect and track concept drifts [16], [17]. However, (1) they were not specifically designed to detect various types of anomalies found in data streams, and (2) they are not able to distinguish between clusters which overlap in geometric space. The reason they cannot differentiate between overlapping clusters is because these algorithms seek to form geometric partitions of the feature space, and therefore do not respect the ground truth. To illustrate this issue, Fig. 1 plots the temporal concepts found while performing activity recognition using a smartphone’s accelerometer. Here any partition of the feature space into three clusters will not respect the ground truth (that the clusters formed from each activity overlap in geometric space).

In this paper we propose a solution for the detection of various types of anomalies in data streams which have overlapping clusters. In a previous work we proposed pcStream: a stream clustering algorithm used to dynamically detect and manage temporal contexts [18]. The name “pcStream” is based on the principal components of the distributions in the data stream which are used to dynamically detect and compare the underlying contexts. One of the advantages of pcStream is that it can detect and model overlapping concepts. Detection and management is accomplished by taking into account the temporal relation of the stream’s observations (i.e., temporal contexts). This is the main reason for pcStream’s ability to outperform state-of-the-art stream clustering algorithms in detecting contexts in real world data streams (the reader is invited to view our original paper for the analysis [18]). Moreover, pcStream tracks concept drifts, keeping the captured contexts relevant and up to date.

In this work, we propose three extensions to the pcStream algorithm. Each extension designed to detect a different type of anomaly found in data streams generated by smartphones. The general approach is to train pcStream on a data stream which captures the normal contexts (including those which overlap). We can then use the trained model to detect point, contextual, or collective anomalies using the respective extension. The concept for deploying these extensions is to have a security agent running on the device. By sampling the relevant numeric features (e.g., an application’s CPU usage), the agent observes the actor for unexpected behaviors which it can either block or inform the user about—thereby providing implicit smartphone security.

Therefore the contributions of this paper are as follows:

(1)
Introduction of a single algorithm, pcStream, for detecting point, contextual, and collective anomalies found in temporal data streams, in particular, data streams whose concepts overlap each other in geometric space (i.e., feature space). We make the source code for this algorithm available online, with versions written in R, Matlab, Python, and PySpark (for Hadoop clusters).¹
(2)
Evaluation of pcStream in the realm of smartphone security. We explore examples of how pcStream can be used as a security solution for addressing current smartphone security threats, specifically, the detection of data leakage, the detection of active malware in dynamic analysis, and the provision of continuous user authentication, all achieved by analyzing data streams.

To evaluate pcStream as an anomaly detection tool, we collected a dataset consisting of eight months of sensor data from 31 volunteers all of whom used Samsung Galaxy S5 smartphones. In order to detect short-term anomalies (such as a malicious transmission or a device theft), we sampled the devices’ sensors at a high temporal resolution. Other existing datasets do not have this level of resolution for a long period of time. This is most likely because many sensors have significant power consumption. To overcome this challenge, we provided the volunteers with battery cases that nearly triple the battery life of the device (ensuring a total battery life of 9–10 h on a full charge).

The remainder of this paper is structured as follows. Section 2, reviews the pcStream algorithm. Section 3 presents the different types of anomalies addressed in this paper, and proposes the anomaly detection extensions to the pcStream algorithm. Section 4 presents the dataset used, evaluation setup, and evaluation results. Section 6 reviews related works. Section 5 provides a discussion on related issues. Finally, Section 7 provides a conclusion, and proposes future work.

Section snippets

Review of the pcStream algorithm

In this section, we briefly review the basic pcStream algorithm as presented in [18].

pcStream anomaly detection extensions

In general, there are three types of anomalies: point anomalies, contextual anomalies, and collective anomalies [13]. In this section, we present three extensions to pcStream which enable it to detect the three anomaly types. In Section 4, we evaluate these added capabilities as implicit smartphone security solutions. Fig. 2 is an illustration referenced throughout this section that visualizes how pcStream views each type of anomaly.

Evaluation

In this section, we evaluate pcStream as an implicit security solution for the smartphone using the extensions described in Section 3. Specifically, we test whether it is possible to detect data leakage (point anomalies), active malware (contextual anomalies), and unauthorized users via continuous authentication (collective anomalies).

As a comparative baseline in the evaluation of pcStream, we evaluate two other stream clustering algorithms: DBStream [26] and D-Stream [27]. These algorithms

Discussion

Parameter selection. Like many machine learning algorithms, pcStream has multiple parameters, namely $t_{\min}$ and $ϕ$ . Finding the optimum parameters can be challenging, especially since stream datasets can be very long and therefore involve a lengthy training time. In this paper we performed a non-exhaustive gridsearch over these parameters, where each point in the grid is a set of parameters. For larger problems, one may consider using a hyper-parameter selection technique which performs a minimal

Related literature

As mentioned, in this study we aim to detect anomalies when explicit information on the application, device, or user behavior is unavailable. In such cases, we opt to analyze sensor data to extract the latent contexts which can be used to detect anomalies.

With the evolvement of smartphones, the importance of implementing context-aware security for such devices has been acknowledged. One example includes the CrePE system [45] which introduced a context-based policy enforcement on applications.

Conclusion

In mobile security, there are situations where it is not possible to explicitly determine the legitimacy of various actors (internal and external). However, in many of these situations, implicit information in the form of data streams can be collected and used for detecting anomalies. Therefore, in this paper we proposed an extension to the pcStream algorithm, enabling it to detecting point, contextual, and collective anomalies in contextual data streams. To evaluate the algorithm’s capability

Acknowledgment

This research had been funded by the Israeli Ministry of Science, Space and Technology.

References (53)

A. Shabtai et al.
Mobile malware detection through analysis of deviations in application network behavior
Comput. Secur.
(2014)
Z. He et al.
Mining class outliers: concepts, algorithms and applications in crm
Expert Syst. Appl.
(2004)
A. Shabtai et al.
Intrusion detection for mobile devices using the knowledge-based, temporal abstraction method
J. Syst. Softw.
(2010)
eMarketer, 2 billion consumers worldwide to get smart(phones) by 2016, 2015....
A.P. Felt et al.
A survey of mobile malware in the wild
P. Faruki et al.
Android security: a survey of issues, malware penetration, and defenses
IEEE Commun. Surv. Tutor.
(2015)
A. Zimmermann et al.
An operational definition of context
X. Zhou et al.
Identity, location, disease and more: Inferring your secrets from android public resources
Android developers glossary, https://developer.android.com/guide/appendix/glossary.html (accessed:...
A. Shabtai et al.
Securing android-powered mobile devices using selinux
IEEE Secur. Privacy
(2009)

M.B. Harries et al.

Extracting hidden context

Mach. Learn.

(1998)

G. Widmer et al.

Learning in the presence of concept drift and hidden contexts

Mach. Learn.

(1996)

G. Widmer

Tracking context changes through meta-learning

Mach. Learn.

(1997)

J.a.B. Gomes et al.

Calds: Context-aware learning from data streams

V. Chandola et al.

Anomaly detection: A survey

ACM Comput. Surv. (CSUR)

(2009)

A. Tsymbal, The problem of concept drift: definitions and related work, Computer Science Department, Trinity College...

C.C. Aggarwal, A survey of stream clustering algorithms,...

J.A. Silva et al.

Data stream clustering: A survey

ACM Comput. Surv.

(2013)

Y. Mirsky et al.

pcstream: A stream clustering algorithm for dynamically detecting and managing temporal contexts

A. Padovitz, S.W. Loke, A. Zaslavsky, Towards a theory of context spaces, in: Pervasive Computing and Communications...

J. Shlens, A tutorial on principal component analysis, arXiv Preprint...

B. Babcock et al.

Maintaining variance and k-medians over data stream windows

S. Wold, M. Sjostrom, Simca: a method for analyzing chemical data in terms of similarity and analogy,...

X. Song et al.

Conditional anomaly detection

IEEE Trans. Knowl. Data Eng.

(2007)

Pausing and resuming an activity, http://developer.android.com/training/basics/activity-lifecycle/pausing.html...

M. Dunham, Y. Meng, J. Huang, Extensible Markov model, in: Data Mining, 2004. ICDM’04. Fourth IEEE International...

Cited by (18)

Network traffic analysis over clustering-based collective anomaly detection
2022, Computer Networks
Citation Excerpt :
Alternative Algorithms. We compare CCAD algorithm with two Clustering-based detection algorithms of PCstream [5]and LDCOF [10] in the evaluating indicator, time efficiency and memory efficiency. Then, we compare CCAD algorithm with the MCOD solutions from the literature [11], which about the point anomaly detection algorithm.
Due to the ever-growing presence of network traffic, there has been a considerable amount of research on anomaly detection in network traffic by clustering. Most of them have not considered the problem that collective anomaly detection in network traffic. Collective anomaly might scatter among multiple clusters when applying the clustering-based algorithms in the anomaly detection. In this paper, we propose a progressive exploration framework for collective anomaly detection on network traffic based on a clustering method, called CCAD. CCAD enables analysts to effectively explore collective anomaly in network traffic. This framework is different from the other anomaly detection methods. It is based on the analysis of the influence of collective anomaly on the clustering results in the network traffic stream data. CCAD framework efficiently supports the collective anomaly exploration. As demonstrated by our extensive experiments with real-world data, CCAD has high detection rate in comparison with other existing methods.
DeepStream: Autoencoder-based stream temporal clustering and anomaly detection
2021, Computers and Security
The increasing number of IoT devices in “smart” environments, such as homes, offices, and cities, produce seemingly endless data streams and drive many daily decisions. Consequently, there is growing interest in identifying contextual information from sensor data to facilitate the performance of various tasks, e.g., traffic management, cyber attack detection, and healthcare monitoring. The correct identification of contexts in data streams is helpful for many tasks, for example, it can assist in providing high-quality recommendations to end users and in reporting anomalous behavior based on the detection of unusual contexts. This paper presents DeepStream, a novel data stream temporal clustering algorithm that dynamically detects sequential and overlapping clusters. DeepStream is tuned to classify contextual information in real time and is capable of coping with a high-dimensional feature space. DeepStream utilizes stacked autoencoders to reduce the dimensionality of unbounded data streams and for cluster representation. This method detects contextual behavior and captures nonlinear relations of the input data, giving it an advantage over existing methods that rely on PCA. We evaluated DeepStream empirically using four sensor and IoT datasets and compared it to five state-of-the-art stream clustering algorithms. Our evaluation shows that DeepStream outperforms all of these algorithms. Our evaluation also demonstrates how DeepStream’s improved clustering performance results in improved detection of anomalous data.
Real-time big data processing for anomaly detection: A Survey
2019, International Journal of Information Management
Citation Excerpt :
Choosing parameter can be demanding for very extensive training time. The framework uses non-exhaustive grid search over this parameter to establish network parameter (Mirsky, Shabtai, Shapira, Elovici, & Rokach, 2017). Nevertheless, there is a need to determine the proficiency of the proposed framework to be in line with current approach of anomaly detection.
The advent of connected devices and omnipresence of Internet have paved way for intruders to attack networks, which leads to cyber-attack, financial loss, information theft in healthcare, and cyber war. Hence, network security analytics has become an important area of concern and has gained intensive attention among researchers, off late, specifically in the domain of anomaly detection in network, which is considered crucial for network security. However, preliminary investigations have revealed that the existing approaches to detect anomalies in network are not effective enough, particularly to detect them in real time. The reason for the inefficacy of current approaches is mainly due the amassment of massive volumes of data though the connected devices. Therefore, it is crucial to propose a framework that effectively handles real time big data processing and detect anomalies in networks. In this regard, this paper attempts to address the issue of detecting anomalies in real time. Respectively, this paper has surveyed the state-of-the-art real-time big data processing technologies related to anomaly detection and the vital characteristics of associated machine learning algorithms. This paper begins with the explanation of essential contexts and taxonomy of real-time big data processing, anomalous detection, and machine learning algorithms, followed by the review of big data processing technologies. Finally, the identified research challenges of real-time big data processing in anomaly detection are discussed.
Smartphones as an integrated platform for monitoring driver behaviour: The role of sensor fusion and connectivity
2018, Transportation Research Part C: Emerging Technologies
Citation Excerpt :
There is a need to develop intelligent smartphone algorithms that can filter streamlined data and detect informative events. It will reduce the requirement to transmit all the raw data centrally but the most informative features (Kanarachos et al., 2015; Martinez et al., 2017; Mirsky et al., 2017; Vasconcelos et al., 2017). In this context, a promising direction is compressive sensing and similar algorithms.
Nowadays, more than half of the world’s web traffic comes from mobile phones, and by 2020 approximately 70 percent of the world’s population will be using smartphones. The unprecedented market penetration of smartphones combined with the connectivity and embedded sensing capability of smartphones is an enabler for the large-scale deployment of Intelligent Transportation Systems (ITS). On the downside, smartphones have inherent limitations such as relatively limited energy capacity, processing power, and accuracy. These shortcomings may potentially limit their role as an integrated platform for monitoring driver behaviour in the context of ITS. This study examines this hypothesis by reviewing recent scientific contributions. The Cybernetics theoretical framework was employed to allow a systematic comparison. First, only a few studies consider the smartphone as an integrated platform. Second, a lack of consistency between the approaches and metrics used in the literature is noted. Last but not least, areas such as fusion of heterogeneous information sources, Deep Learning and sparse crowd-sensing are identified as relatively unexplored, and future research in these directions is suggested.
A Survey of Collective Anomaly Detection on Sequence Dataset
2023, International Journal of Data Warehousing and Mining
Performance Evaluation of Anomaly Detection System on Portable LTE Telecommunication Networks Using OpenAirInterface and ELK
2023, International Journal of Technology

View all citing articles on Scopus

View full text

Anomaly detection for smartphone data streams

Abstract

Introduction

Section snippets

Review of the pcStream algorithm

pcStream anomaly detection extensions

Evaluation

Discussion

Related literature

Conclusion

Acknowledgment

Comput. Secur.

Expert Syst. Appl.

J. Syst. Softw.

A survey of mobile malware in the wild

Android security: a survey of issues, malware penetration, and defenses

IEEE Commun. Surv. Tutor.

An operational definition of context

Identity, location, disease and more: Inferring your secrets from android public resources

Securing android-powered mobile devices using selinux

IEEE Secur. Privacy

Extracting hidden context

Mach. Learn.

Learning in the presence of concept drift and hidden contexts

Mach. Learn.

Tracking context changes through meta-learning

Mach. Learn.

Calds: Context-aware learning from data streams

Anomaly detection: A survey

ACM Comput. Surv. (CSUR)

Data stream clustering: A survey

ACM Comput. Surv.

pcstream: A stream clustering algorithm for dynamically detecting and managing temporal contexts

Maintaining variance and k-medians over data stream windows

Conditional anomaly detection

IEEE Trans. Knowl. Data Eng.