Skip to main content
Log in

SharkFin: Spatio-temporal mining of software adoption and penetration

  • Original Article
  • Published:
Social Network Analysis and Mining Aims and scope Submit manuscript

Abstract

How does malware propagate? Does it form spikes over time? Does it resemble the propagation pattern of benign files, such as software patches? Does it spread uniformly over countries? How long does it take for a URL that distributes malware to be detected and shut down? In this work, we answer these questions by analyzing patterns from 22 million malicious (and benign) files, found on 1.6 million hosts worldwide during the month of June 2011. We conduct this study using the WINE database available at Symantec Research Labs. Additionally, we explore the research questions raised by sampling on such large databases of executables; the importance of studying the implications of sampling is twofold: First, sampling is a means of reducing the size of the database hence making it more accessible to researchers; second, because every such data collection can be perceived as a sample of the real world. We discover the SharkFin temporal propagation pattern of executable files, the GeoSplit pattern in the geographical spread of machines that report executables to Symantec’s servers, the Periodic Power Law (Ppl) distribution of the lifetime of URLs, and we show how to efficiently extrapolate crucial properties of the data from a small sample. We further investigate the propagation pattern of benign and malicious executables, unveiling latent structures in the way these files spread. To the best of our knowledge, our work represents the largest study of propagation patterns of executables.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. Each month’s second Tuesday, on which Microsoft releases security patches.

References

  • Anderson RM, May RM (1982) Coevolution of hosts and parasites. Parasitology 85(02):411–426

    Article  Google Scholar 

  • Barabasi AL, Albert R (1999) Emergence of scaling in random networks. Science 286(5439):509–512

    Article  MathSciNet  Google Scholar 

  • Bilge L, Dumitras T (2012) Before we knew it: an empirical study of zero-day attacks in the real world. In: ACM conference on computer and communications security, Raleigh, NC, Oct 2012

  • Broder A, Kumar R, Maghoul F, Raghavan P, Rajagopalan S, Stata R, Tomkins A, Wiener J (2000) Graph structure in the web. WWW9/Comput Netw 33(1–6):309–320

    Article  Google Scholar 

  • Caballero J, Grier C, Kreibich C, Paxson V (2011) Measuring pay-per-install: the commoditization of malware distribution. In: USENIX Security Symposium, USENIX Association

  • Camp J, Cranor L, Feamster N, Feigenbaum J, Forrest S, Kotz D, Lee W, Lincoln P, Paxson V, Reiter M, Rivest R, Sanders W, Savage S, Smith S, Spafford E, Stolfo S (2009) Data for cybersecurity research: process and “wish list”. http://www.gtisc.gatech.edu/files_nsf10/data-wishlist.pdf, June 2009

  • Chau DH, Nachenberg C, Wilhelm J, Wright A, Faloutsos C (2011) Polonium: tera-scale graph mining and inference for malware detection. SIAM Int Conf Data Min, 2011

  • Crane R, Sornette D (2008) Robust dynamic classes revealed by measuring the response function of a social system. Proc Natl Acad Sci 105(41):15649–15653

    Article  Google Scholar 

  • Crovella M, Bestavros A (1996) Self-similarity in world wide web traffic, evidence and possible causes. Sigmetrics, pp 160–169

  • Dumitras T, Shou D (2011) Toward a standard benchmark for computer security research: the worldwide intelligence network environment (WINE). In: EuroSys BADGERS workshop, Salzburg, Austria

    Google Scholar 

  • Faloutsos C, Matias Y, Silberschatz A (1996) Modeling skewed distribution using multifractals and the 80-20’ law. Computer Science Department, p 547

  • Faloutsos M, Faloutsos P, Faloutsos C (1999) On power-law relationships of the internet topology. SIGCOMM, pp 251–262, Aug–Sept. 1999

  • Gkantsidis C, Karagiannis T, Vojnovic M (2006) Planet scale software updates. In: Rizzo L, Anderson TE, McKeown N (eds) SIGCOMM. ACM, New York, pp 423–434

  • Hethcote HW (2000) The mathematics of infectious diseases. SIAM Rev 42(4):599–653

    Article  MathSciNet  MATH  Google Scholar 

  • Leskovec J, Backstrom L, Kumar R, Tomkins A (2008) Microscopic evolution of social networks. In: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York, pp 462–470

  • Leskovec J, McGlohon M, Faloutsos C, Glance N, Hurst M (2007) Cascading behavior in large blog graphs. arXiv preprint arXiv:0704.2803

  • Levenberg K (1944) A method for the solution of certain non-linear problems in least squares. Q J Appl Math I I(2):164–168

  • Maggs B (2012) Personal communication

  • Mandelbrot B (1977) Fractals: form, chance, and dimension, vol 1. Freeman, W. H, USA

  • Matsubara Y, Sakurai Y, Aditya Prakash B, Li L, Faloutsos C (2012) Rise and fall patterns of information diffusion: model and implications. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York, pp 6–14

  • Moore D, Paxson V, Savage S, Shannon C, Staniford S, Weaver N (2003) Inside the Slammer worm. Security & privacy. IEEE 1(4):33–39

    Google Scholar 

  • Moore D, Shannon C, Claffy KC (2002) Code-red: a case study on the spread and victims of an internet worm. In: Internet measurement workshop. ACM, New York, pp 273–284

  • Newman MEJ (2005) Power laws, pareto distributions and zipf’s law. Contemp Phys 46

  • Papalexakis EE, Sidiropoulos ND (2011) Co-clustering as multilinear decomposition with sparse latent factors. In: 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, New York, pp 2064–2067

  • Papalexakis EE, Sidiropoulos ND, Bro R (2013) From k-means to higher-way co-clustering: multilinear decomposition with sparse latent factors. IEEE transactions on signal processing

  • Riedi RH, Crouse MS, Ribeiro VJ, Baraniuk RG (1999) A multifractal wavelet model with application to network traffic. In: IEEE Transactions on information theory, vol 3, April 1999

  • Schroeder M (1991) Fractals, chaos, power laws, 6 edn. W. H. Freeman, New York

  • Staniford S, Moore D, Paxson V, Weaver N (2004) The top speed of flash worms. In: Paxson V (ed) WORM. ACM Press, New York, pp 33–42

  • Staniford S, Paxson V, Weaver N (2002) How to own the internet in your spare time. In: Proceedings of the 11th USENIX Security Symposium. USENIX Association, Berkeley, pp 149–167

  • Symantec Corporation (2012) Symantec Internet security threat report, vol. 17. http://www.symantec.com/threatreport/ April 2012

  • Wang M, Ailamaki A, Faloutsos C (2002) Capturing the spatio-temporal behavior of real traffic data. Perform Eval 49(1/4):147–163

    Article  MATH  Google Scholar 

  • Wang M, Madhyastha T, Hang Chang N, Papadimitriou S, Faloutsos C (2002) Fast algorithms for modeling bursty traffic. ICDE, Data mining meets performance evaluation

  • Weaver N, Ellis D (2004) Reflections on Witty: analyzing the attacker; login. USENIX Mag 29(3):34–37

  • Zipf GK (1949) Human behavior and principle of least effort: an introduction to human ecology. Addison Wesley, Cambridge

    Google Scholar 

Download references

Acknowledgments

We thank Vern Paxson and Marc Dacier, for their early feedback on the the design and effects of the WINE sampling strategy. The data analyzed in this paper are available for follow-on research as the reference data set WINE-2012-006. Research was also sponsored by the Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF-09-2-0053.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Evangelos E. Papalexakis.

Appendix

Appendix

The SharkFin model

The SharkFin model of executable propagation is a generalization of the SpikeM model (Matsubara et al. 2012) for the spreading of memes through blogs. We briefly describe the model in Matsubara et al. (2012), adapting to the task at hand

The model assumes a total number of \(N\) machines that can be infected. Let \(U(n)\) be the number of machines that are not infected at time \(n;\) \(I(n)\) be the count of machines that got infected up to time \(n-1;\) and \(\Delta I(n)\) be count of machines infected exactly at time \(n.\) Then, \(U(n+1)=U(n) - \Delta I(n+1)\) with initial conditions \(\Delta I(0) = 0\) and \(U(0) = N\).

Additionally, we let \(\beta\) as the strength of that executable file. We assume that the infectiveness of a file on a machine drops as a specific power law based on the elapsed time since the file infected that machine (say \(\tau\)) i.e. \(f(\tau ) = \beta \tau ^{-1.5}.\) Finally, we also have to consider one more parameter for our model: the "external shock”, or in other words, the first appearance of a file: let \(n_b\) the time that this initial burst appeared, and let \(S(n_b)\) be the size of the shock (count of infected machines).

Finally, to account for periodicity, we define a periodic function \(p(n)\) with three parameters: \(P_a,\) as the strength of the periodicity, \(P_p\) as the period and \(P_s\) as the phase shift.

Putting it all together, our SharkFin model is

$$\begin{aligned} \Delta I(n+1) = p(n+1) \left( U(n) \sum _{t = n_b}^n \left( \Delta I(t) + S(t) \right) f(n+1 - t) + \epsilon \right) \end{aligned}$$

where \(p(n) = 1 - \frac{1}{2}P_a \left( \sin \left( \frac{2\pi }{P_p} \left( n+P_s\right) \right) \right) ,\) and \(\epsilon\) models external noise.

No-sampling version: If \(X(n), n=1\dots T\) is the sequence of file occurrences we want to model as a SharkFin spike, we want to minimize the following:

$$\begin{aligned} \min _{{ \varvec{\theta }}} \displaystyle {\sum _{n=1}^{T} \left( X(n) - \Delta I(n) \right) ^2 } \end{aligned}$$

where \({ \varvec{\theta }} = \left[\begin{array}{*{20}l} N&\beta&S_b&P_a&P_s \end{array}\right]^T\) is the vector of model parameters.

With sampling: If we are dealing with a sample of file occurrences, with sampling rate \(s,\) then we solve the problem:

$$\begin{aligned} \min _{{ \varvec{\theta }}} \displaystyle {\sum _{n=1}^{T} \left( sX(n) - \Delta I(n) \right) ^2 } \end{aligned}$$

In both cases, we use Levenberg–Marquardt (Levenberg 1944) to solve for the parameters of our SharkFin model.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Papalexakis, E.E., Dumitras, T., Chau, D.H. et al. SharkFin: Spatio-temporal mining of software adoption and penetration. Soc. Netw. Anal. Min. 4, 240 (2014). https://doi.org/10.1007/s13278-014-0240-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13278-014-0240-2

Keywords

Navigation