SharkFin: Spatio-temporal mining of software adoption and penetration

Papalexakis, Evangelos E.; Dumitras, Tudor; Chau, Duen Horng; Prakash, B. Aditya; Faloutsos, Christos

doi:10.1007/s13278-014-0240-2

SharkFin: Spatio-temporal mining of software adoption and penetration

Original Article
Published: 27 November 2014

Volume 4, article number 240, (2014)
Cite this article

Social Network Analysis and Mining Aims and scope Submit manuscript

Evangelos E. Papalexakis¹,
Tudor Dumitras²,
Duen Horng Chau³,
B. Aditya Prakash⁴ &
…
Christos Faloutsos¹

215 Accesses
Explore all metrics

Abstract

How does malware propagate? Does it form spikes over time? Does it resemble the propagation pattern of benign files, such as software patches? Does it spread uniformly over countries? How long does it take for a URL that distributes malware to be detected and shut down? In this work, we answer these questions by analyzing patterns from 22 million malicious (and benign) files, found on 1.6 million hosts worldwide during the month of June 2011. We conduct this study using the WINE database available at Symantec Research Labs. Additionally, we explore the research questions raised by sampling on such large databases of executables; the importance of studying the implications of sampling is twofold: First, sampling is a means of reducing the size of the database hence making it more accessible to researchers; second, because every such data collection can be perceived as a sample of the real world. We discover the SharkFin temporal propagation pattern of executable files, the GeoSplit pattern in the geographical spread of machines that report executables to Symantec’s servers, the Periodic Power Law (Ppl) distribution of the lifetime of URLs, and we show how to efficiently extrapolate crucial properties of the data from a small sample. We further investigate the propagation pattern of benign and malicious executables, unveiling latent structures in the way these files spread. To the best of our knowledge, our work represents the largest study of propagation patterns of executables.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big data analytics on Apache Spark

Article 13 October 2016

How different are different diff algorithms in Git?

Article Open access 11 September 2019

The General Data Protection Regulation in the Age of Surveillance Capitalism

Article 18 June 2019

Notes

Each month’s second Tuesday, on which Microsoft releases security patches.

References

Anderson RM, May RM (1982) Coevolution of hosts and parasites. Parasitology 85(02):411–426
Article Google Scholar
Barabasi AL, Albert R (1999) Emergence of scaling in random networks. Science 286(5439):509–512
Article MathSciNet Google Scholar
Bilge L, Dumitras T (2012) Before we knew it: an empirical study of zero-day attacks in the real world. In: ACM conference on computer and communications security, Raleigh, NC, Oct 2012
Broder A, Kumar R, Maghoul F, Raghavan P, Rajagopalan S, Stata R, Tomkins A, Wiener J (2000) Graph structure in the web. WWW9/Comput Netw 33(1–6):309–320
Article Google Scholar
Caballero J, Grier C, Kreibich C, Paxson V (2011) Measuring pay-per-install: the commoditization of malware distribution. In: USENIX Security Symposium, USENIX Association
Camp J, Cranor L, Feamster N, Feigenbaum J, Forrest S, Kotz D, Lee W, Lincoln P, Paxson V, Reiter M, Rivest R, Sanders W, Savage S, Smith S, Spafford E, Stolfo S (2009) Data for cybersecurity research: process and “wish list”. http://www.gtisc.gatech.edu/files_nsf10/data-wishlist.pdf, June 2009
Chau DH, Nachenberg C, Wilhelm J, Wright A, Faloutsos C (2011) Polonium: tera-scale graph mining and inference for malware detection. SIAM Int Conf Data Min, 2011
Crane R, Sornette D (2008) Robust dynamic classes revealed by measuring the response function of a social system. Proc Natl Acad Sci 105(41):15649–15653
Article Google Scholar
Crovella M, Bestavros A (1996) Self-similarity in world wide web traffic, evidence and possible causes. Sigmetrics, pp 160–169
Dumitras T, Shou D (2011) Toward a standard benchmark for computer security research: the worldwide intelligence network environment (WINE). In: EuroSys BADGERS workshop, Salzburg, Austria
Google Scholar
Faloutsos C, Matias Y, Silberschatz A (1996) Modeling skewed distribution using multifractals and the 80-20’ law. Computer Science Department, p 547
Faloutsos M, Faloutsos P, Faloutsos C (1999) On power-law relationships of the internet topology. SIGCOMM, pp 251–262, Aug–Sept. 1999
Gkantsidis C, Karagiannis T, Vojnovic M (2006) Planet scale software updates. In: Rizzo L, Anderson TE, McKeown N (eds) SIGCOMM. ACM, New York, pp 423–434
Hethcote HW (2000) The mathematics of infectious diseases. SIAM Rev 42(4):599–653
Article MathSciNet MATH Google Scholar
Leskovec J, Backstrom L, Kumar R, Tomkins A (2008) Microscopic evolution of social networks. In: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York, pp 462–470
Leskovec J, McGlohon M, Faloutsos C, Glance N, Hurst M (2007) Cascading behavior in large blog graphs. arXiv preprint arXiv:0704.2803
Levenberg K (1944) A method for the solution of certain non-linear problems in least squares. Q J Appl Math I I(2):164–168
Maggs B (2012) Personal communication
Mandelbrot B (1977) Fractals: form, chance, and dimension, vol 1. Freeman, W. H, USA
Matsubara Y, Sakurai Y, Aditya Prakash B, Li L, Faloutsos C (2012) Rise and fall patterns of information diffusion: model and implications. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York, pp 6–14
Moore D, Paxson V, Savage S, Shannon C, Staniford S, Weaver N (2003) Inside the Slammer worm. Security & privacy. IEEE 1(4):33–39
Google Scholar
Moore D, Shannon C, Claffy KC (2002) Code-red: a case study on the spread and victims of an internet worm. In: Internet measurement workshop. ACM, New York, pp 273–284
Newman MEJ (2005) Power laws, pareto distributions and zipf’s law. Contemp Phys 46
Papalexakis EE, Sidiropoulos ND (2011) Co-clustering as multilinear decomposition with sparse latent factors. In: 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, New York, pp 2064–2067
Papalexakis EE, Sidiropoulos ND, Bro R (2013) From k-means to higher-way co-clustering: multilinear decomposition with sparse latent factors. IEEE transactions on signal processing
Riedi RH, Crouse MS, Ribeiro VJ, Baraniuk RG (1999) A multifractal wavelet model with application to network traffic. In: IEEE Transactions on information theory, vol 3, April 1999
Schroeder M (1991) Fractals, chaos, power laws, 6 edn. W. H. Freeman, New York
Staniford S, Moore D, Paxson V, Weaver N (2004) The top speed of flash worms. In: Paxson V (ed) WORM. ACM Press, New York, pp 33–42
Staniford S, Paxson V, Weaver N (2002) How to own the internet in your spare time. In: Proceedings of the 11th USENIX Security Symposium. USENIX Association, Berkeley, pp 149–167
Symantec Corporation (2012) Symantec Internet security threat report, vol. 17. http://www.symantec.com/threatreport/ April 2012
Wang M, Ailamaki A, Faloutsos C (2002) Capturing the spatio-temporal behavior of real traffic data. Perform Eval 49(1/4):147–163
Article MATH Google Scholar
Wang M, Madhyastha T, Hang Chang N, Papadimitriou S, Faloutsos C (2002) Fast algorithms for modeling bursty traffic. ICDE, Data mining meets performance evaluation
Weaver N, Ellis D (2004) Reflections on Witty: analyzing the attacker; login. USENIX Mag 29(3):34–37
Zipf GK (1949) Human behavior and principle of least effort: an introduction to human ecology. Addison Wesley, Cambridge
Google Scholar

Download references

Acknowledgments

We thank Vern Paxson and Marc Dacier, for their early feedback on the the design and effects of the WINE sampling strategy. The data analyzed in this paper are available for follow-on research as the reference data set WINE-2012-006. Research was also sponsored by the Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF-09-2-0053.

Author information

Authors and Affiliations

School of Computer Science, Carnegie Mellon University, Pittsburgh, USA
Evangelos E. Papalexakis & Christos Faloutsos
Department of ECE, University of Maryland, College Park, USA
Tudor Dumitras
School of Computational Science and Engineering, Georgia Tech, Atlanta, USA
Duen Horng Chau
Computer Science Department, Virginia Tech, Blacksburg, USA
B. Aditya Prakash

Authors

Evangelos E. Papalexakis
View author publications
You can also search for this author in PubMed Google Scholar
Tudor Dumitras
View author publications
You can also search for this author in PubMed Google Scholar
Duen Horng Chau
View author publications
You can also search for this author in PubMed Google Scholar
B. Aditya Prakash
View author publications
You can also search for this author in PubMed Google Scholar
Christos Faloutsos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Evangelos E. Papalexakis.

Appendix

The SharkFin model

The SharkFin model of executable propagation is a generalization of the SpikeM model (Matsubara et al. 2012) for the spreading of memes through blogs. We briefly describe the model in Matsubara et al. (2012), adapting to the task at hand

The model assumes a total number of $N$ machines that can be infected. Let $U(n)$ be the number of machines that are not infected at time $n;$ $I(n)$ be the count of machines that got infected up to time $n-1;$ and $\Delta I(n)$ be count of machines infected exactly at time $n.$ Then, $U(n+1)=U(n) - \Delta I(n+1)$ with initial conditions $\Delta I(0) = 0$ and $U(0) = N$.

Additionally, we let $\beta$ as the strength of that executable file. We assume that the infectiveness of a file on a machine drops as a specific power law based on the elapsed time since the file infected that machine (say $\tau$) i.e. $f(\tau ) = \beta \tau ^{-1.5}.$ Finally, we also have to consider one more parameter for our model: the "external shock”, or in other words, the first appearance of a file: let $n_b$ the time that this initial burst appeared, and let $S(n_b)$ be the size of the shock (count of infected machines).

Finally, to account for periodicity, we define a periodic function $p(n)$ with three parameters: $P_a,$ as the strength of the periodicity, $P_p$ as the period and $P_s$ as the phase shift.

Putting it all together, our SharkFin model is

$$\begin{aligned} \Delta I(n+1) = p(n+1) \left( U(n) \sum _{t = n_b}^n \left( \Delta I(t) + S(t) \right) f(n+1 - t) + \epsilon \right) \end{aligned}$$

where $p(n) = 1 - \frac{1}{2}P_a \left( \sin \left( \frac{2\pi }{P_p} \left( n+P_s\right) \right) \right) ,$ and $\epsilon$ models external noise.

No-sampling version: If $X(n), n=1\dots T$ is the sequence of file occurrences we want to model as a SharkFin spike, we want to minimize the following:

$$\begin{aligned} \min _{{ \varvec{\theta }}} \displaystyle {\sum _{n=1}^{T} \left( X(n) - \Delta I(n) \right) ^2 } \end{aligned}$$

where ${ \varvec{\theta }} = \left[\begin{array}{*{20}l} N&\beta&S_b&P_a&P_s \end{array}\right]^T$ is the vector of model parameters.

With sampling: If we are dealing with a sample of file occurrences, with sampling rate $s,$ then we solve the problem:

$$\begin{aligned} \min _{{ \varvec{\theta }}} \displaystyle {\sum _{n=1}^{T} \left( sX(n) - \Delta I(n) \right) ^2 } \end{aligned}$$

In both cases, we use Levenberg–Marquardt (Levenberg 1944) to solve for the parameters of our SharkFin model.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Papalexakis, E.E., Dumitras, T., Chau, D.H. et al. SharkFin: Spatio-temporal mining of software adoption and penetration. Soc. Netw. Anal. Min. 4, 240 (2014). https://doi.org/10.1007/s13278-014-0240-2

Download citation

Received: 17 December 2013
Revised: 13 November 2014
Accepted: 17 November 2014
Published: 27 November 2014
DOI: https://doi.org/10.1007/s13278-014-0240-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SharkFin: Spatio-temporal mining of software adoption and penetration

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

How different are different diff algorithms in Git?

The General Data Protection Regulation in the Age of Surveillance Capitalism

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

SharkFin: Spatio-temporal mining of software adoption and penetration

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

How different are different diff algorithms in Git?

The General Data Protection Regulation in the Age of Surveillance Capitalism

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation