Discovery of anomalous behaviour in temporal networks
Introduction
In recent years, there has been a tremendous use of social network analysis to identify social behaviour. Clearly the advent of social media has offered a substantial amount of data that allows scientists from different disciplines to identify trends and principles of social human behaviour. Social network analysis (SNA) has also been used for counterterrorism purposes, in the identification of important structures in covert networks (Valdis, 2002). For example, Valdis (2002) was able to identify the leader of the group responsible for 9/11 after the publication of the available data. While SNA is very useful in the identification of fundamental structures of networks, it assumes that (most of the) data is available. Once a covert network has been discovered, the data becomes available as in Valdis (2002). Crucially, during investigations, only partial data is available, which might not be sufficient to warrant any form of social network analysis.
Most of the time statistical analysis in SNA focuses on identifying average responses from measures. Average measures are very important if the analysis is used to identify trends. For example, if the ultimate goal of the analysis is to identify marketing strategies, clearly there is an interest in understanding how the majority of nodes in the network operate. Similar arguments can be applied in other settings such as financial services, retail banking, etc. In these settings, outliers can be safely ignored. In the investigative setting, however, the assumption regarding the completeness of available data and the need to identify average measures turn out not to be appropriate. Realistically, prior to any discovery, it is unlikely that investigators have all data available. For this reason, the basic assumption that all data is available is simply not realistic. Secondly, it is likely that criminals will want to hide their behaviour, and thus will deviate from standard behaviour. For this reason the use of average measures cannot be of interest, but outliers can be very important. In this paper we address the problem of detecting anomalous behaviour in large networks. The application that we envisage is the situation where lots of data is available, but not all of it is useful to the ultimate goal of the analysis. For example, during police or financial investigations, a lot of data is collected which should be of help in taking the investigation further. Very often the data collected can be overwhelmingly large, while only a tiny portion of it is actually useful to the ultimate goal of the analysis. Some form of mechanical analysis of the data is required to extract the useful parts. The problem is similar to identifying a needle in the haystack: the most important part is to isolate which part of the region is useful. In the criminal setting, we can make a few observations that can help us to delineate the problem further. First of all, as criminal actions (in any field) are (relatively) rare, only a small percentage of the population is involved. Similarly, covert behaviour differs from ordinary and everyday behaviour, so if we were able to characterise in some way, with some precise meaning, ‘normal behaviour’ or ‘average behaviour’ we could say that some ‘non-normal behaviour’ could be due to criminal activity. Note we are not suggesting that all ‘non-normal behaviour’ is due to criminal activities. Anomalous behaviour suggests that the activity of individuals is deviating from expected or normal behaviour. Discovery of anomalous behaviour could potentially have a great impact in various applications such as fraud detection (in financial services, tax, telecommunication, and credit cards), security investigations, epidemiology, and many others. In particular, in medical epidemiology, outliers reveal how diseases are spread especially at the early stages. This understanding could be very helpful for prevention. For example, taking existing large data and performing the analysis proposed in this paper may suggest that at the early stages of epidemic infection, the carriers are the outliers with respect to some behaviour, like contact with animals, or with a group of individuals.
In this paper we present a methodology to identify anomalous behaviour in a large data set. The main idea consists of performing an accurate statistical analysis (Trivedi, 2002, Harrison and Patel, 1992, William, 2004, Bernardo and Smith, 1994) to establish if the data at hand reveals something out of the ordinary (Mary et al., 2009, Silva and Willett, 2008). Our methodology is simple, yet powerful: we consider a temporal network and analyse the distribution of a specific event among two nodes over a relatively long period of time. We consider discrete time and infer the parameters of our distribution from the data and analyse the tail of the distribution over the null hypothesis. The null hypothesis asks if the behaviour of a given individual is normal and the rejection of the null hypothesis gives us the outliers. We make no assumption on the behaviour of individuals in the networks on their relative relationship: this is because we wish to keep our methods general enough to be applicable to a wide range of fields. Several interesting models have been devised to study social networks and their evolution (Snijders et al., 2010). In particular, Snijders has an attractive model based on Markov chains, where, roughly speaking, the probability of moving from one state to another depends on a set of variables that represent the state of the social network at that particular time. Snijders’ model assumes continuous time: in our analysis we consider discrete time. Our model is very minimal: it makes very few assumptions but is also rather crude. Continuous time models are more accurate than discrete time ones. They assume that the phenomenon under consideration is observed continuously, hence rate of probabilistic events are necessary for the modelling.
The rest of the paper is structured as follows: in Section 2 we define the problem we are addressing; in Section 3 we explain the statistical model; in Section 4 we show how our method applies to the VAST data set 2008 and to recently collected Twitter data; a conclusion and consideration of future work finalise the article.
Section snippets
The problem
Consider a scenario where, given a rather large group of people, we know that some of these are of interest to us, but we do not know which ones. We only know that some of these could be criminal and they are preparing for some important activities. We can identify all individuals, as shown in Fig. 1. The network contains 400 nodes with 1562 directed edges. How can we identify who is behaving in a strange way? The network displayed in Fig. 1 has relatively normal features as far as networks of
Discrete time model
The model of the network is a graph 〈N, E〉 composed of a set of nodes N and a set of edges E. For each pair of a nodes (i, j) of the graph, we use the random variable Ni,j(t) to represent the communication of the pair of nodes in the interval from [0, t]. Further we define a temporal view of the graph as follows:
For directed graphs, Eq. (1) defines the numbers of outgoing edges from node i up to time t, Eq. (2) defines the number of
VAST data set 2008
The VAST data set 2008 is a synthetic data set created specifically for the VAST challenge contest in the year 2008.2 The data contains telephone records of a group of 400 individuals over a period of ten days. Information about the data is given: it is assumed that among the group of people under observation, there are a few criminals. The main reason to use this data is that the challenge has been solved and
Comparison with relevant work
Several of the concepts reported in this paper have been inspired by Heard et al. (2010), who have analysed network data similar to the VAST data set 2008. The main distinction between their work and ours is their use of Bayesian statistics (William, 2004, Bernardo and Smith, 1994) to infer the distribution of the communication data over the long period of time. Bayesian analysis requires an initial assumption on the probability of each trial.
In Heard et al. (2010), the number of calls made in
Conclusion
We have reported a methodology for discovering anomalies using statistical analysis. At the moment we have considered the VAST data set 2008 and a rather large Twitter dataset with good results. The main problem that this analysis faces is the presence of false positives and future research should be devoted in finding suitable ways to eliminate false positives as much as possible. To deal with false positives we envisage essentially two methods: to reduce the number of false positives or rank
Acknowledgment
We gratefully acknowledge Donal Simmie for having collected the Twitter data.
References (11)
- et al.
Introduction to actor-based models for network dynamics
Soc. Networks
(2010) - et al.
Bayesian Theory
(1994) - et al.
Perfomance Modelling and Communication Networks and Computer Architectures
(1992) - et al.
Bayesian anomaly detection methods for social networks
Ann. Appl. Stat.
(2010) - et al.
Anomaly detection in large graphs. CMU-CS-09-173
(2009)
Cited by (20)
DeepProfile: Finding fake profile in online social network using dynamic CNN
2020, Journal of Information Security and ApplicationsCitation Excerpt :Sudden changing of access pattern for the information and behavior allows the server to catch the suspicious account up. If it fails, the anomalous can infect the system with existing fraudulent [32]. The infected account also caused by a Cyborg, a type of fake account with forged identities.
Fake profile detection techniques in large-scale online social networks: A comprehensive review
2018, Computers and Electrical EngineeringCitation Excerpt :The FBI model is a scalable and generalized model for calculating influence among users in OSNs. Anomalous behavior was discovered using statistical methods to detect the intrusion of criminals in temporal networks [21]. Two datasets, from Twitter and VAST, were monitored over 10 days, during which all tweets and telephone calls, respectively, were analyzed to classify the observed behavior as normal or abnormal.
A Novel Graph Centrality Based Approach to Analyze Anomalous Nodes with Negative Behavior
2016, Procedia Computer ScienceA survey of data mining and social network analysis based anomaly detection techniques
2016, Egyptian Informatics JournalCitation Excerpt :Categorization of anomalous behavior is based upon the scoring function being used along with the application area under consideration. Also, the most significant and pertinent subset of nodes is used by Vigliotti and Hankin [105] to detect anomalous patterns in huge dynamic networks. In their work the experiments were performed on the temporal networks.
Fake Account Detection in Twitter using Long Short-Term Memory and Convolutional Neural Network
2024, International Journal of Engineering Trends and Technology