Elsevier

Social Networks

Volume 41, May 2015, Pages 18-25
Social Networks

Discovery of anomalous behaviour in temporal networks

https://doi.org/10.1016/j.socnet.2014.12.001Get rights and content

Highlights

  • We propose a new statistical analysis for highlighting anomalous behavior in social networks.

  • Our method overcomes the shortcomings of standard social network analysis.

  • We show that our method can be applied to large dataset from social media such as Twitter.

Abstract

In this work we consider the problem of detecting anomalous behaviour and present a novel approach that allows ‘behaviour’ to be classified as either to be normal or abnormal by checking the p-value associated with the occurrence of the behaviour which is modelled following a binomial distribution within a discrete time model. We investigate the problem of detecting anomalous behaviour by looking at how communication evolves over time in a social network graph. Under the assumption that some nodes of the network could be labelled qualitatively, we present a novel approach that allows us to infer a subset of nodes of the social network which might share the same qualitative connotation. In other words, assuming one or more members belong to some criminal organisation, we wish to investigate how many other persons belong to the same organisation. We have tested our method in two datasets, VAST2008 and a Twitter Dataset (data collected in 2012), with encouraging results.

Introduction

In recent years, there has been a tremendous use of social network analysis to identify social behaviour. Clearly the advent of social media has offered a substantial amount of data that allows scientists from different disciplines to identify trends and principles of social human behaviour. Social network analysis (SNA) has also been used for counterterrorism purposes, in the identification of important structures in covert networks (Valdis, 2002). For example, Valdis (2002) was able to identify the leader of the group responsible for 9/11 after the publication of the available data. While SNA is very useful in the identification of fundamental structures of networks, it assumes that (most of the) data is available. Once a covert network has been discovered, the data becomes available as in Valdis (2002). Crucially, during investigations, only partial data is available, which might not be sufficient to warrant any form of social network analysis.

Most of the time statistical analysis in SNA focuses on identifying average responses from measures. Average measures are very important if the analysis is used to identify trends. For example, if the ultimate goal of the analysis is to identify marketing strategies, clearly there is an interest in understanding how the majority of nodes in the network operate. Similar arguments can be applied in other settings such as financial services, retail banking, etc. In these settings, outliers can be safely ignored. In the investigative setting, however, the assumption regarding the completeness of available data and the need to identify average measures turn out not to be appropriate. Realistically, prior to any discovery, it is unlikely that investigators have all data available. For this reason, the basic assumption that all data is available is simply not realistic. Secondly, it is likely that criminals will want to hide their behaviour, and thus will deviate from standard behaviour. For this reason the use of average measures cannot be of interest, but outliers can be very important. In this paper we address the problem of detecting anomalous behaviour in large networks. The application that we envisage is the situation where lots of data is available, but not all of it is useful to the ultimate goal of the analysis. For example, during police or financial investigations, a lot of data is collected which should be of help in taking the investigation further. Very often the data collected can be overwhelmingly large, while only a tiny portion of it is actually useful to the ultimate goal of the analysis. Some form of mechanical analysis of the data is required to extract the useful parts. The problem is similar to identifying a needle in the haystack: the most important part is to isolate which part of the region is useful. In the criminal setting, we can make a few observations that can help us to delineate the problem further. First of all, as criminal actions (in any field) are (relatively) rare, only a small percentage of the population is involved. Similarly, covert behaviour differs from ordinary and everyday behaviour, so if we were able to characterise in some way, with some precise meaning, ‘normal behaviour’ or ‘average behaviour’ we could say that some ‘non-normal behaviour’ could be due to criminal activity. Note we are not suggesting that all ‘non-normal behaviour’ is due to criminal activities. Anomalous behaviour suggests that the activity of individuals is deviating from expected or normal behaviour. Discovery of anomalous behaviour could potentially have a great impact in various applications such as fraud detection (in financial services, tax, telecommunication, and credit cards), security investigations, epidemiology, and many others. In particular, in medical epidemiology, outliers reveal how diseases are spread especially at the early stages. This understanding could be very helpful for prevention. For example, taking existing large data and performing the analysis proposed in this paper may suggest that at the early stages of epidemic infection, the carriers are the outliers with respect to some behaviour, like contact with animals, or with a group of individuals.

In this paper we present a methodology to identify anomalous behaviour in a large data set. The main idea consists of performing an accurate statistical analysis (Trivedi, 2002, Harrison and Patel, 1992, William, 2004, Bernardo and Smith, 1994) to establish if the data at hand reveals something out of the ordinary (Mary et al., 2009, Silva and Willett, 2008). Our methodology is simple, yet powerful: we consider a temporal network and analyse the distribution of a specific event among two nodes over a relatively long period of time. We consider discrete time and infer the parameters of our distribution from the data and analyse the tail of the distribution over the null hypothesis. The null hypothesis asks if the behaviour of a given individual is normal and the rejection of the null hypothesis gives us the outliers. We make no assumption on the behaviour of individuals in the networks on their relative relationship: this is because we wish to keep our methods general enough to be applicable to a wide range of fields. Several interesting models have been devised to study social networks and their evolution (Snijders et al., 2010). In particular, Snijders has an attractive model based on Markov chains, where, roughly speaking, the probability of moving from one state to another depends on a set of variables that represent the state of the social network at that particular time. Snijders’ model assumes continuous time: in our analysis we consider discrete time. Our model is very minimal: it makes very few assumptions but is also rather crude. Continuous time models are more accurate than discrete time ones. They assume that the phenomenon under consideration is observed continuously, hence rate of probabilistic events are necessary for the modelling.

The rest of the paper is structured as follows: in Section 2 we define the problem we are addressing; in Section 3 we explain the statistical model; in Section 4 we show how our method applies to the VAST data set 2008 and to recently collected Twitter data; a conclusion and consideration of future work finalise the article.

Section snippets

The problem

Consider a scenario where, given a rather large group of people, we know that some of these are of interest to us, but we do not know which ones. We only know that some of these could be criminal and they are preparing for some important activities. We can identify all individuals, as shown in Fig. 1. The network contains 400 nodes with 1562 directed edges. How can we identify who is behaving in a strange way? The network displayed in Fig. 1 has relatively normal features as far as networks of

Discrete time model

The model of the network is a graph 〈N, E〉 composed of a set of nodes N and a set of edges E. For each pair of a nodes (i, j) of the graph, we use the random variable Ni,j(t) to represent the communication of the pair of nodes in the interval from [0, t]. Further we define a temporal view of the graph as follows:Ni.(t)=ijjNi,j(t)N.j(t)=ijiNi,j(t)N..(t)=ijjijiNi,j(t)

For directed graphs, Eq. (1) defines the numbers of outgoing edges from node i up to time t, Eq. (2) defines the number of

VAST data set 2008

The VAST data set 2008 is a synthetic data set created specifically for the VAST challenge contest in the year 2008.2 The data contains telephone records of a group of 400 individuals over a period of ten days. Information about the data is given: it is assumed that among the group of people under observation, there are a few criminals. The main reason to use this data is that the challenge has been solved and

Comparison with relevant work

Several of the concepts reported in this paper have been inspired by Heard et al. (2010), who have analysed network data similar to the VAST data set 2008. The main distinction between their work and ours is their use of Bayesian statistics (William, 2004, Bernardo and Smith, 1994) to infer the distribution of the communication data over the long period of time. Bayesian analysis requires an initial assumption on the probability of each trial.

In Heard et al. (2010), the number of calls made in

Conclusion

We have reported a methodology for discovering anomalies using statistical analysis. At the moment we have considered the VAST data set 2008 and a rather large Twitter dataset with good results. The main problem that this analysis faces is the presence of false positives and future research should be devoted in finding suitable ways to eliminate false positives as much as possible. To deal with false positives we envisage essentially two methods: to reduce the number of false positives or rank

Acknowledgment

We gratefully acknowledge Donal Simmie for having collected the Twitter data.

References (11)

  • T.A.B. Snijders et al.

    Introduction to actor-based models for network dynamics

    Soc. Networks

    (2010)
  • J.M. Bernardo et al.

    Bayesian Theory

    (1994)
  • P.G. Harrison et al.

    Perfomance Modelling and Communication Networks and Computer Architectures

    (1992)
  • N.A. Heard et al.

    Bayesian anomaly detection methods for social networks

    Ann. Appl. Stat.

    (2010)
  • M. McGlohon Leman Akoglu et al.

    Anomaly detection in large graphs. CMU-CS-09-173

    (2009)
There are more references available in the full text version of this article.

Cited by (20)

  • DeepProfile: Finding fake profile in online social network using dynamic CNN

    2020, Journal of Information Security and Applications
    Citation Excerpt :

    Sudden changing of access pattern for the information and behavior allows the server to catch the suspicious account up. If it fails, the anomalous can infect the system with existing fraudulent [32]. The infected account also caused by a Cyborg, a type of fake account with forged identities.

  • Fake profile detection techniques in large-scale online social networks: A comprehensive review

    2018, Computers and Electrical Engineering
    Citation Excerpt :

    The FBI model is a scalable and generalized model for calculating influence among users in OSNs. Anomalous behavior was discovered using statistical methods to detect the intrusion of criminals in temporal networks [21]. Two datasets, from Twitter and VAST, were monitored over 10 days, during which all tweets and telephone calls, respectively, were analyzed to classify the observed behavior as normal or abnormal.

  • A survey of data mining and social network analysis based anomaly detection techniques

    2016, Egyptian Informatics Journal
    Citation Excerpt :

    Categorization of anomalous behavior is based upon the scoring function being used along with the application area under consideration. Also, the most significant and pertinent subset of nodes is used by Vigliotti and Hankin [105] to detect anomalous patterns in huge dynamic networks. In their work the experiments were performed on the temporal networks.

  • Fake Account Detection in Twitter using Long Short-Term Memory and Convolutional Neural Network

    2024, International Journal of Engineering Trends and Technology
View all citing articles on Scopus
View full text