Anomaly detection for smartphone data streams
Introduction
In 2016, over two billion people will have a smartphone as a part of their daily lives [1]. Smartphones provide a means of communication, as well as a central location to store and organize information, a quality which makes the popular devices enticing targets for attackers interested in stealing private information [2]. One way to protect a smartphone is to perform anomaly detection [3]. In this approach, the normal behavior of an internal or external actor is modeled so that malicious activities can be detected as anomalies, behaviors which do not fit the norm. Subsequently the detected malicious activity can be blocked, thereby protecting the user. For example, take the malicious behavior of sending SMSs to premium numbers (monetary theft). In this case, explicit information such as the message’s textual content or the destination number could be used to directly determine whether the SMS is anomalous. However, in some cases explicit information is unavailable or insufficient, making it a challenge to detect the SMS as malicious. For instance, if the SMS contains legitimate text (stolen from the user’s outbox), or if there is no complete list of premium numbers to blacklist.
In cases where explicit information is lacking, many times contextual information, often in the form of a data streams, is available. Contextual information is the additional information that assists in clarifying a particular event or behavior [4]. In this paper, we refer to data streams as unbounded sequences of measurements sampled continuously from a particular source. Modern smartphones are equipped with a wide range of sources which can be sampled in order to generate a data stream rich in contextual information. For instance, Android smartphones allow all applications to receive information about the phone’s other applications (including statistics related to the CPU, memory usage, and system priorities) through the Linux virtual/proc/folder [5]. Furthermore, information on device motion and device status is available as well. When sampled, these data streams can be used to capture the actor’s contexts in order to detect possible anomalies. Returning to our example, let us assume that the device’s motion sensors are sampled when SMSs are sent. In this scenario, by utilizing the contextual information from the sensor’s data stream, it becomes easier to differentiate between a physical human (external actor) sending an SMS and some automated malicious code (internal actor) which is sending the SMS.
Here are additional examples of situations in which explicit information for the detection of anomalies is lacking but contextual data streams are available:
- •
Outbound encrypted transmissions: typically a malware (such as a bot) sends data back to its command and control server (C&C) over an encrypted channel. Therefore, semantic analysis or other explicit information about the data in motion is not obtainable. However, the context of the encrypted transmission, along with other details about transmissions captures information useful in determining a transmission’s legitimacy [6].
- •
Activities of applications in the Android OS platform: Applications running in Android are sandboxed in separate Dalvik virtual machines (DVM) [7]. At a basic level, this prevents applications from accessing other applications’ data and resources without explicit user permissions [8]. In order to gain full access, a device must be rooted, giving all applications access to privileged commands within Android’s subsystems, a security risk in itself. Since devices are shipped unrooted by default, antivirus applications available through the Android marketplace are highly limited in the dynamic analysis (online scanning) they can perform in order to detect anomalies. However, without root privileges applications can obtain contextual information by sampling other applications’ statistics (e.g., CPU utilization, memory usage, etc.)
- •
High level inference from low level trust-zones: Understanding what applications are doing from a low level trust zone (e.g., hypervisor) is difficult, because kernel (as well as DVM) information is not directly accessible. For instance, it is not clear what application is sending data or which process is currently in the foreground legitimately using the CPU. However, the motion data and screen on/off data are available from the hypervisor. Using this data, it may be possible to detect illegitimate transmissions as they occur.
In order to utilize the contextual information in a data stream, one needs to mine the stream for the hidden contexts. The hidden contexts found in a real world data stream can exhibit behaviors in the form of correlated distributions (clusters) [9]. These clusters of observations are referred to in literature as the concepts or contexts captured by the data [9], [10], [11], [12]. By modeling these contexts, it is possible to detect anomalies in an unsupervised manner [13], [14]. However, the detection of anomalies in data streams is a challenging process. This is because data streams are unbounded in length and involve recurring concepts as well as concept drifts [15]. These properties make it difficult to distinguish between a previously seen concept which is now changing and a new concept (an anomaly) which has not been seen before. Current stream clustering algorithms can detect and track concept drifts [16], [17]. However, (1) they were not specifically designed to detect various types of anomalies found in data streams, and (2) they are not able to distinguish between clusters which overlap in geometric space. The reason they cannot differentiate between overlapping clusters is because these algorithms seek to form geometric partitions of the feature space, and therefore do not respect the ground truth. To illustrate this issue, Fig. 1 plots the temporal concepts found while performing activity recognition using a smartphone’s accelerometer. Here any partition of the feature space into three clusters will not respect the ground truth (that the clusters formed from each activity overlap in geometric space).
In this paper we propose a solution for the detection of various types of anomalies in data streams which have overlapping clusters. In a previous work we proposed pcStream: a stream clustering algorithm used to dynamically detect and manage temporal contexts [18]. The name “pcStream” is based on the principal components of the distributions in the data stream which are used to dynamically detect and compare the underlying contexts. One of the advantages of pcStream is that it can detect and model overlapping concepts. Detection and management is accomplished by taking into account the temporal relation of the stream’s observations (i.e., temporal contexts). This is the main reason for pcStream’s ability to outperform state-of-the-art stream clustering algorithms in detecting contexts in real world data streams (the reader is invited to view our original paper for the analysis [18]). Moreover, pcStream tracks concept drifts, keeping the captured contexts relevant and up to date.
In this work, we propose three extensions to the pcStream algorithm. Each extension designed to detect a different type of anomaly found in data streams generated by smartphones. The general approach is to train pcStream on a data stream which captures the normal contexts (including those which overlap). We can then use the trained model to detect point, contextual, or collective anomalies using the respective extension. The concept for deploying these extensions is to have a security agent running on the device. By sampling the relevant numeric features (e.g., an application’s CPU usage), the agent observes the actor for unexpected behaviors which it can either block or inform the user about—thereby providing implicit smartphone security.
Therefore the contributions of this paper are as follows:
- (1)
Introduction of a single algorithm, pcStream, for detecting point, contextual, and collective anomalies found in temporal data streams, in particular, data streams whose concepts overlap each other in geometric space (i.e., feature space). We make the source code for this algorithm available online, with versions written in R, Matlab, Python, and PySpark (for Hadoop clusters).1
- (2)
Evaluation of pcStream in the realm of smartphone security. We explore examples of how pcStream can be used as a security solution for addressing current smartphone security threats, specifically, the detection of data leakage, the detection of active malware in dynamic analysis, and the provision of continuous user authentication, all achieved by analyzing data streams.
To evaluate pcStream as an anomaly detection tool, we collected a dataset consisting of eight months of sensor data from 31 volunteers all of whom used Samsung Galaxy S5 smartphones. In order to detect short-term anomalies (such as a malicious transmission or a device theft), we sampled the devices’ sensors at a high temporal resolution. Other existing datasets do not have this level of resolution for a long period of time. This is most likely because many sensors have significant power consumption. To overcome this challenge, we provided the volunteers with battery cases that nearly triple the battery life of the device (ensuring a total battery life of 9–10 h on a full charge).
The remainder of this paper is structured as follows. Section 2, reviews the pcStream algorithm. Section 3 presents the different types of anomalies addressed in this paper, and proposes the anomaly detection extensions to the pcStream algorithm. Section 4 presents the dataset used, evaluation setup, and evaluation results. Section 6 reviews related works. Section 5 provides a discussion on related issues. Finally, Section 7 provides a conclusion, and proposes future work.
Section snippets
Review of the pcStream algorithm
In this section, we briefly review the basic pcStream algorithm as presented in [18].
pcStream anomaly detection extensions
In general, there are three types of anomalies: point anomalies, contextual anomalies, and collective anomalies [13]. In this section, we present three extensions to pcStream which enable it to detect the three anomaly types. In Section 4, we evaluate these added capabilities as implicit smartphone security solutions. Fig. 2 is an illustration referenced throughout this section that visualizes how pcStream views each type of anomaly.
Evaluation
In this section, we evaluate pcStream as an implicit security solution for the smartphone using the extensions described in Section 3. Specifically, we test whether it is possible to detect data leakage (point anomalies), active malware (contextual anomalies), and unauthorized users via continuous authentication (collective anomalies).
As a comparative baseline in the evaluation of pcStream, we evaluate two other stream clustering algorithms: DBStream [26] and D-Stream [27]. These algorithms
Discussion
Parameter selection. Like many machine learning algorithms, pcStream has multiple parameters, namely and . Finding the optimum parameters can be challenging, especially since stream datasets can be very long and therefore involve a lengthy training time. In this paper we performed a non-exhaustive gridsearch over these parameters, where each point in the grid is a set of parameters. For larger problems, one may consider using a hyper-parameter selection technique which performs a minimal
Related literature
As mentioned, in this study we aim to detect anomalies when explicit information on the application, device, or user behavior is unavailable. In such cases, we opt to analyze sensor data to extract the latent contexts which can be used to detect anomalies.
With the evolvement of smartphones, the importance of implementing context-aware security for such devices has been acknowledged. One example includes the CrePE system [45] which introduced a context-based policy enforcement on applications.
Conclusion
In mobile security, there are situations where it is not possible to explicitly determine the legitimacy of various actors (internal and external). However, in many of these situations, implicit information in the form of data streams can be collected and used for detecting anomalies. Therefore, in this paper we proposed an extension to the pcStream algorithm, enabling it to detecting point, contextual, and collective anomalies in contextual data streams. To evaluate the algorithm’s capability
Acknowledgment
This research had been funded by the Israeli Ministry of Science, Space and Technology.
References (53)
- et al.
Mobile malware detection through analysis of deviations in application network behavior
Comput. Secur.
(2014) - et al.
Mining class outliers: concepts, algorithms and applications in crm
Expert Syst. Appl.
(2004) - et al.
Intrusion detection for mobile devices using the knowledge-based, temporal abstraction method
J. Syst. Softw.
(2010) - eMarketer, 2 billion consumers worldwide to get smart(phones) by 2016, 2015....
- et al.
A survey of mobile malware in the wild
- et al.
Android security: a survey of issues, malware penetration, and defenses
IEEE Commun. Surv. Tutor.
(2015) - et al.
An operational definition of context
- et al.
Identity, location, disease and more: Inferring your secrets from android public resources
- Android developers glossary, https://developer.android.com/guide/appendix/glossary.html (accessed:...
- et al.
Securing android-powered mobile devices using selinux
IEEE Secur. Privacy
(2009)
Extracting hidden context
Mach. Learn.
Learning in the presence of concept drift and hidden contexts
Mach. Learn.
Tracking context changes through meta-learning
Mach. Learn.
Calds: Context-aware learning from data streams
Anomaly detection: A survey
ACM Comput. Surv. (CSUR)
Data stream clustering: A survey
ACM Comput. Surv.
pcstream: A stream clustering algorithm for dynamically detecting and managing temporal contexts
Maintaining variance and k-medians over data stream windows
Conditional anomaly detection
IEEE Trans. Knowl. Data Eng.
Cited by (18)
Network traffic analysis over clustering-based collective anomaly detection
2022, Computer NetworksCitation Excerpt :Alternative Algorithms. We compare CCAD algorithm with two Clustering-based detection algorithms of PCstream [5]and LDCOF [10] in the evaluating indicator, time efficiency and memory efficiency. Then, we compare CCAD algorithm with the MCOD solutions from the literature [11], which about the point anomaly detection algorithm.
DeepStream: Autoencoder-based stream temporal clustering and anomaly detection
2021, Computers and SecurityReal-time big data processing for anomaly detection: A Survey
2019, International Journal of Information ManagementCitation Excerpt :Choosing parameter can be demanding for very extensive training time. The framework uses non-exhaustive grid search over this parameter to establish network parameter (Mirsky, Shabtai, Shapira, Elovici, & Rokach, 2017). Nevertheless, there is a need to determine the proficiency of the proposed framework to be in line with current approach of anomaly detection.
Smartphones as an integrated platform for monitoring driver behaviour: The role of sensor fusion and connectivity
2018, Transportation Research Part C: Emerging TechnologiesCitation Excerpt :There is a need to develop intelligent smartphone algorithms that can filter streamlined data and detect informative events. It will reduce the requirement to transmit all the raw data centrally but the most informative features (Kanarachos et al., 2015; Martinez et al., 2017; Mirsky et al., 2017; Vasconcelos et al., 2017). In this context, a promising direction is compressive sensing and similar algorithms.
A Survey of Collective Anomaly Detection on Sequence Dataset
2023, International Journal of Data Warehousing and MiningPerformance Evaluation of Anomaly Detection System on Portable LTE Telecommunication Networks Using OpenAirInterface and ELK
2023, International Journal of Technology