Abstract
How does malware propagate? Does it form spikes over time? Does it resemble the propagation pattern of benign files, such as software patches? Does it spread uniformly over countries? How long does it take for a URL that distributes malware to be detected and shut down? In this work, we answer these questions by analyzing patterns from 22 million malicious (and benign) files, found on 1.6 million hosts worldwide during the month of June 2011. We conduct this study using the WINE database available at Symantec Research Labs. Additionally, we explore the research questions raised by sampling on such large databases of executables; the importance of studying the implications of sampling is twofold: First, sampling is a means of reducing the size of the database hence making it more accessible to researchers; second, because every such data collection can be perceived as a sample of the real world. We discover the SharkFin temporal propagation pattern of executable files, the GeoSplit pattern in the geographical spread of machines that report executables to Symantec’s servers, the Periodic Power Law (Ppl) distribution of the lifetime of URLs, and we show how to efficiently extrapolate crucial properties of the data from a small sample. We further investigate the propagation pattern of benign and malicious executables, unveiling latent structures in the way these files spread. To the best of our knowledge, our work represents the largest study of propagation patterns of executables.
Similar content being viewed by others
Notes
Each month’s second Tuesday, on which Microsoft releases security patches.
References
Anderson RM, May RM (1982) Coevolution of hosts and parasites. Parasitology 85(02):411–426
Barabasi AL, Albert R (1999) Emergence of scaling in random networks. Science 286(5439):509–512
Bilge L, Dumitras T (2012) Before we knew it: an empirical study of zero-day attacks in the real world. In: ACM conference on computer and communications security, Raleigh, NC, Oct 2012
Broder A, Kumar R, Maghoul F, Raghavan P, Rajagopalan S, Stata R, Tomkins A, Wiener J (2000) Graph structure in the web. WWW9/Comput Netw 33(1–6):309–320
Caballero J, Grier C, Kreibich C, Paxson V (2011) Measuring pay-per-install: the commoditization of malware distribution. In: USENIX Security Symposium, USENIX Association
Camp J, Cranor L, Feamster N, Feigenbaum J, Forrest S, Kotz D, Lee W, Lincoln P, Paxson V, Reiter M, Rivest R, Sanders W, Savage S, Smith S, Spafford E, Stolfo S (2009) Data for cybersecurity research: process and “wish list”. http://www.gtisc.gatech.edu/files_nsf10/data-wishlist.pdf, June 2009
Chau DH, Nachenberg C, Wilhelm J, Wright A, Faloutsos C (2011) Polonium: tera-scale graph mining and inference for malware detection. SIAM Int Conf Data Min, 2011
Crane R, Sornette D (2008) Robust dynamic classes revealed by measuring the response function of a social system. Proc Natl Acad Sci 105(41):15649–15653
Crovella M, Bestavros A (1996) Self-similarity in world wide web traffic, evidence and possible causes. Sigmetrics, pp 160–169
Dumitras T, Shou D (2011) Toward a standard benchmark for computer security research: the worldwide intelligence network environment (WINE). In: EuroSys BADGERS workshop, Salzburg, Austria
Faloutsos C, Matias Y, Silberschatz A (1996) Modeling skewed distribution using multifractals and the 80-20’ law. Computer Science Department, p 547
Faloutsos M, Faloutsos P, Faloutsos C (1999) On power-law relationships of the internet topology. SIGCOMM, pp 251–262, Aug–Sept. 1999
Gkantsidis C, Karagiannis T, Vojnovic M (2006) Planet scale software updates. In: Rizzo L, Anderson TE, McKeown N (eds) SIGCOMM. ACM, New York, pp 423–434
Hethcote HW (2000) The mathematics of infectious diseases. SIAM Rev 42(4):599–653
Leskovec J, Backstrom L, Kumar R, Tomkins A (2008) Microscopic evolution of social networks. In: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York, pp 462–470
Leskovec J, McGlohon M, Faloutsos C, Glance N, Hurst M (2007) Cascading behavior in large blog graphs. arXiv preprint arXiv:0704.2803
Levenberg K (1944) A method for the solution of certain non-linear problems in least squares. Q J Appl Math I I(2):164–168
Maggs B (2012) Personal communication
Mandelbrot B (1977) Fractals: form, chance, and dimension, vol 1. Freeman, W. H, USA
Matsubara Y, Sakurai Y, Aditya Prakash B, Li L, Faloutsos C (2012) Rise and fall patterns of information diffusion: model and implications. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York, pp 6–14
Moore D, Paxson V, Savage S, Shannon C, Staniford S, Weaver N (2003) Inside the Slammer worm. Security & privacy. IEEE 1(4):33–39
Moore D, Shannon C, Claffy KC (2002) Code-red: a case study on the spread and victims of an internet worm. In: Internet measurement workshop. ACM, New York, pp 273–284
Newman MEJ (2005) Power laws, pareto distributions and zipf’s law. Contemp Phys 46
Papalexakis EE, Sidiropoulos ND (2011) Co-clustering as multilinear decomposition with sparse latent factors. In: 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, New York, pp 2064–2067
Papalexakis EE, Sidiropoulos ND, Bro R (2013) From k-means to higher-way co-clustering: multilinear decomposition with sparse latent factors. IEEE transactions on signal processing
Riedi RH, Crouse MS, Ribeiro VJ, Baraniuk RG (1999) A multifractal wavelet model with application to network traffic. In: IEEE Transactions on information theory, vol 3, April 1999
Schroeder M (1991) Fractals, chaos, power laws, 6 edn. W. H. Freeman, New York
Staniford S, Moore D, Paxson V, Weaver N (2004) The top speed of flash worms. In: Paxson V (ed) WORM. ACM Press, New York, pp 33–42
Staniford S, Paxson V, Weaver N (2002) How to own the internet in your spare time. In: Proceedings of the 11th USENIX Security Symposium. USENIX Association, Berkeley, pp 149–167
Symantec Corporation (2012) Symantec Internet security threat report, vol. 17. http://www.symantec.com/threatreport/ April 2012
Wang M, Ailamaki A, Faloutsos C (2002) Capturing the spatio-temporal behavior of real traffic data. Perform Eval 49(1/4):147–163
Wang M, Madhyastha T, Hang Chang N, Papadimitriou S, Faloutsos C (2002) Fast algorithms for modeling bursty traffic. ICDE, Data mining meets performance evaluation
Weaver N, Ellis D (2004) Reflections on Witty: analyzing the attacker; login. USENIX Mag 29(3):34–37
Zipf GK (1949) Human behavior and principle of least effort: an introduction to human ecology. Addison Wesley, Cambridge
Acknowledgments
We thank Vern Paxson and Marc Dacier, for their early feedback on the the design and effects of the WINE sampling strategy. The data analyzed in this paper are available for follow-on research as the reference data set WINE-2012-006. Research was also sponsored by the Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF-09-2-0053.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
The SharkFin model
The SharkFin model of executable propagation is a generalization of the SpikeM model (Matsubara et al. 2012) for the spreading of memes through blogs. We briefly describe the model in Matsubara et al. (2012), adapting to the task at hand
The model assumes a total number of \(N\) machines that can be infected. Let \(U(n)\) be the number of machines that are not infected at time \(n;\) \(I(n)\) be the count of machines that got infected up to time \(n-1;\) and \(\Delta I(n)\) be count of machines infected exactly at time \(n.\) Then, \(U(n+1)=U(n) - \Delta I(n+1)\) with initial conditions \(\Delta I(0) = 0\) and \(U(0) = N\).
Additionally, we let \(\beta\) as the strength of that executable file. We assume that the infectiveness of a file on a machine drops as a specific power law based on the elapsed time since the file infected that machine (say \(\tau\)) i.e. \(f(\tau ) = \beta \tau ^{-1.5}.\) Finally, we also have to consider one more parameter for our model: the "external shock”, or in other words, the first appearance of a file: let \(n_b\) the time that this initial burst appeared, and let \(S(n_b)\) be the size of the shock (count of infected machines).
Finally, to account for periodicity, we define a periodic function \(p(n)\) with three parameters: \(P_a,\) as the strength of the periodicity, \(P_p\) as the period and \(P_s\) as the phase shift.
Putting it all together, our SharkFin model is
where \(p(n) = 1 - \frac{1}{2}P_a \left( \sin \left( \frac{2\pi }{P_p} \left( n+P_s\right) \right) \right) ,\) and \(\epsilon\) models external noise.
No-sampling version: If \(X(n), n=1\dots T\) is the sequence of file occurrences we want to model as a SharkFin spike, we want to minimize the following:
where \({ \varvec{\theta }} = \left[\begin{array}{*{20}l} N&\beta&S_b&P_a&P_s \end{array}\right]^T\) is the vector of model parameters.
With sampling: If we are dealing with a sample of file occurrences, with sampling rate \(s,\) then we solve the problem:
In both cases, we use Levenberg–Marquardt (Levenberg 1944) to solve for the parameters of our SharkFin model.
Rights and permissions
About this article
Cite this article
Papalexakis, E.E., Dumitras, T., Chau, D.H. et al. SharkFin: Spatio-temporal mining of software adoption and penetration. Soc. Netw. Anal. Min. 4, 240 (2014). https://doi.org/10.1007/s13278-014-0240-2
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13278-014-0240-2