Skip to main content
Log in

Efficiently spotting the starting points of an epidemic in a large graph

  • Regular paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Given a snapshot of a large graph, in which an infection has been spreading for some time, can we identify those nodes from which the infection started to spread? In other words, can we reliably tell who the culprits are? In this paper, we answer this question affirmatively and give an efficient method called NetSleuth for the well-known susceptible-infected virus propagation model. Essentially, we are after that set of seed nodes that best explain the given snapshot. We propose to employ the minimum description length principle to identify the best set of seed nodes and virus propagation ripple, as the one by which we can most succinctly describe the infected graph. We give an highly efficient algorithm to identify likely sets of seed nodes given a snapshot. Then, given these seed nodes, we show we can optimize the virus propagation ripple in a principled way by maximizing likelihood. With all three combined, NetSleuth can automatically identify the correct number of seed nodes, as well as which nodes are the culprits. Experimentation on our method shows high accuracy in the detection of seed nodes, in addition to the correct automatic identification of their number. Moreover, NetSleuth scales linearly in the number of nodes of the graph.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. When not interested in the actual ripple \(R\), one could encode \(G_I\) by its overall probability starting from \(\mathcal{S }\). Obtaining this probability, however, is very expensive, even by MCMC sampling. As we will see in Sects. 5 and 6, computing a good ripple is both cheap and gives good results.

  2. For more information, see http://topology.eecs.umich.edu/data.html.

  3. We use the standard definition of Jaccard distance between two sets \(\mathcal A \) and \(\mathcal B = 1 - \frac{|\mathcal A \cap \mathcal B |}{|\mathcal A \cup \mathcal B |}\).

References

  1. Anderson RM, May RM (1991) Infectious diseases of humans: dynamics and control. Oxford University Press, Oxford

    Google Scholar 

  2. Bikhchandani S, Hirshleifer D, Welch I (1992) A theory of fads, fashion, custom, and cultural change in informational cascades. Polit Econ 100(5):992–1026

    Google Scholar 

  3. Briesemeister L, Lincoln P, Porras P (2003) Epidemic profiles and defense of scale-free networks. In: WORM 2003, Washington, DC

  4. Cilibrasi R, Vitányi P (2005) Clustering by compression. IEEE Trans Inf Technol 51(4):1523–1545

    Article  Google Scholar 

  5. Cover TM, Thomas JA (2006) Elements of information theory. Wiley-Interscience, New York, pp 110–112

    MATH  Google Scholar 

  6. Cvetković DM, Doob M, Sachs H (1998) Spectra of graphs: theory and applications, 3rd edn

  7. Chakrabarti D, Papadimitriou S, Modha DS, Faloutsos C (2004) Fully automatic cross-associations. In: Proceedings of the 10th ACM international conference on knowledge discovery and data mining (SIGKDD), Seattle, WA, pp 79–88

  8. Chakrabarti D, Wang Y, Wang C, Leskovec J, Faloutsos C (2008) Epidemic thresholds in real networks. TISSEC 10(4)

  9. Chen W, Wang C, Wang Y (2010) Scalable influence maximization for prevalent viral marketing in large-scale social networks. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’10). ACM, New York, pp 1029–1038. doi:10.1145/1835804.1835934. http://doi.acm.org/10.1145/1835804.1835934

  10. Faloutsos C, Megalooikonomou V (2007) On data mining, compression and Kolmogorov complexity. In: Webb G (ed) Data mining and knowledge discovery, vol 15. Springer, Berlin, pp 3–20

  11. Grünwald P (2007) The minimum description length principle. MIT Press, Cambridge

    Google Scholar 

  12. Goldenberg J, Libai B, Muller E (2001) Talk of the network: a complex systems look at the underlying process of word-of-mouth. Market Lett 12(3):211–223

    Google Scholar 

  13. Gruhl D, Guha R, Liben-Nowell D, Tomkins A (2004) Information diffusion through blogspace. In: Proceedings of the 13th international conference on World Wide Web (WWW)

  14. Ganesh A, Massoulié L, Towsley D (2005) The effect of network topology on the spread of epidemics. In: INFOCOM

  15. Goyal A, Lu W, Lakshmanan LVS (2011) Simpath: an efficient algorithm for influence maximization under the linear threshold model. In: Proceedings of the 11th IEEE international conference on data mining (ICDM), Vancouver, Canada

  16. Kephart JO, White SR (1993) Measuring and modeling computer virus prevalence. In: SP

  17. Kempe D, Kleinberg J, Tardos E (2003) Maximizing the spread of influence through a social network. In: Proceedings of the 9th ACM international conference on knowledge discovery and data mining (SIGKDD), Washington, DC

  18. Li M, Vitányi P (1993) An introduction to Kolmogorov complexity and its applications. Springer, Berlin

    Book  MATH  Google Scholar 

  19. Leskovec J, Adamic LA, Huberman BA (2006) The dynamics of viral marketing. In: EC

  20. Leskovec J, Krause A, Guestrin C, Faloutsos C, VanBriesen J, Glance NS (2007a) Cost-effective outbreak detection in networks. In: Proceedings of the 13th ACM international conference on knowledge discovery and data mining (SIGKDD), San Jose, CA, pp 420–429

  21. Leskovec J, McGlohon M, Faloutsos C, Glance N, Hurst M (2007b) Cascading behavior in large blog graphs: patterns and a model. In: Proceedings of the 7th SIAM international conference on data mining (SDM), Minneapolis, MN

  22. Lappas T, Terzi E, Gunopulos D, Mannila H (2010) Finding effectors in social networks. In: Proceedings of the 16th ACM international conference on knowledge discovery and data mining (SIGKDD), Washington, DC, pp 1059–1068

  23. McCuler CR (2000) The many proofs and applications of Perron’s theorem. SIAM Rev 42:1

    Google Scholar 

  24. Pastor-Santorras R, Vespignani A (2001) Epidemic spreading in scale-free networks. Phys Rev Lett 86:14

    Article  Google Scholar 

  25. Prakash BA, Tong H, Valler N, Faloutsos M, Faloutsos C (2010) Virus propagation on time-varying networks: theory and immunization algorithms. In: Proceedings of the European conference on machine learning and principles and practice of knowledge discovery in databases (ECML PKDD), Barcelona, Spain

  26. Prakash BA, Chakrabarti D, Faloutsos M, Valler N, Faloutsos C (2011) Threshold conditions for arbitrary cascade models on arbitrary networks. In: Proceedings of the 11th IEEE international conference on data mining (ICDM), Vancouver, Canada

  27. Prakash BA, Chakrabarti D, Valler N, Faloutsos M, Faloutsos C (2012) Threshold conditions for arbitrary cascade models on arbitrary networks. Knowl Inf Syst 33(3):549–575

    Article  Google Scholar 

  28. Rissanen J (1978) Modeling by shortest data description. Automatica 14(1):465–471

    Article  MATH  Google Scholar 

  29. Rissanen J (1983) Modeling by shortest data description. Ann Stat 11(2):416–431

    Article  MATH  MathSciNet  Google Scholar 

  30. Richardson M, Domingos P (2002) Mining knowledge-sharing sites for viral marketing. In: Proceedings of the 8th ACM international conference on knowledge discovery and data mining (SIGKDD), Edmonton, Alberta

  31. Roos T, Rissanen J (2008) On sequentially normalized maximum likelihood models. In: Proceedings of the workshop on information theoretic methods in science and engineering (WITMSE)

  32. Saito K, Kimura M, Ohara K, Motoda H (2012) Efficient discovery of influential nodes for sis models in social networks. Knowl Inf Syst 30(3):613–635

    Article  Google Scholar 

  33. Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464

    Article  MATH  Google Scholar 

  34. Strang G (1988) Linear algebra and its applications, 3rd edn. Harcourt Brace Jonanovich, San Diego

    Google Scholar 

  35. Shah D, Zaman T (2010) Detecting sources of computer viruses in networks: theory and experiment. In: SIGMETRICS, pp 203–214

  36. Shah D, Zaman T (2011) Rumors in a network: who’s the culprit? IEEE Trans Inf Technol 57(8):5163–5181

    Article  MathSciNet  Google Scholar 

  37. Smets K, Vreeken J (2011) The odd one out: Identifying and characterising anomalies. In: Proceedings of the 11th SIAM international conference on data mining (SDM), Mesa, AZ, society for industrial and applied mathematics (SIAM), pp 804–815

  38. Tong H, Prakash BA, Tsourakakis CE, Eliassi-Rad T, Faloutsos C, Chau DH (2010) On the vulnerability of large graphs. In: Proceedings of the 10th IEEE international conference on data mining (ICDM), Sydney, Australia

  39. Vereshchagin N, Vitanyi P (2004) Kolmogorov’s structure functions and model selection. IEEE Trans Inf Technol 50(12):3265–3290

    Article  MathSciNet  Google Scholar 

  40. Vreeken J, van Leeuwen M, Siebes A (2011) Krimp: mining itemsets that compress. Data Min Knowl Discov 23(1):169–214

    Article  MATH  MathSciNet  Google Scholar 

  41. Wallace C (2005) Statistical and inductive inference by minimum message length. Springer, Berlin

    MATH  Google Scholar 

  42. Zhao J, Wu J, Feng X, Xiong H, Xu K (2011) Information propagation in online social networks: a tie-strength perspective. Knowl Inf Syst. 1–20. doi:10.1007/s10115-011-0445-x

Download references

Acknowledgments

This material is based upon work supported by the Army Research Laboratory under Cooperative Agreement No. W911NF-09-2-0053 and the National Science Foundation under Grant No. IIS-1017415. Jilles Vreeken is supported by a Postdoctoral Fellowship of the Research Foundation—Flanders (fwo).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to B. Aditya Prakash.

Appendix 1

Appendix 1

1.1 Proofs

We formally prove various Lemmas used in the paper in this section.

Proof of Lemma 1

The graph \(G_I\) is connected (as we assume the set of infected nodes are connected, otherwise we are just dealing with separate problems, one for each connected component). Now, the Laplacian matrix \(L(G)\) has all entries except the diagonal elements as nonpositive and all the diagonal elements as positive. In addition, \(L(G)\) is a symmetric matrix (as \(A(G)\) is symmetric). The matrix \(L_A\) is a principal submatrix (i.e., it has been formed by removing matching rows and columns) of size \(N_I \times N_I\) of \(L(G)\). As a result, \(L_A\) is symmetric.

Consider matrix \(M = (I - \frac{L_A}{\sigma })\). Clearly, it is symmetric (due to the above). Consider some diagonal element \(M_{ii}\) (for some index i):

$$\begin{aligned} M_{ii}&= 1 - \frac{d_i}{1+\sigma } \nonumber \\&= 1 - \frac{d_i}{1+d_{\mathrm{max}}} \end{aligned}$$
(14)
$$\begin{aligned}&> 0 \end{aligned}$$
(15)

Any off-diagonal element \(M_{ij}\) is

$$\begin{aligned} M_{ij} = {\left\{ \begin{array}{ll} 0, &{} \mathrm{if }\,\{i, j\} \notin E_I\\ \frac{1}{1+d_{\mathrm{max}}}, &{} \mathrm{if }\,\{i, j\} \in E_I \end{array}\right. } \end{aligned}$$

Hence, first, \(M\) is a nonnegative matrix. Further, the structure of the matrix \(M = (I - \frac{L_A}{\sigma })\) represents the adjacency matrix of a weighted connected graph \(G_M\) (with self-loops). This is because its off-diagonal elements are nonzero only when the corresponding edge is present in \(G_I\). Hence, as \(G_I\) is connected, so is \(G_M\). Now, as \(G_M\) is connected, we have that \(M\) is irreducible.

Finally, applying the well-known Perron–Frobenius theorem [23] on the nonnegative irreducible matrix \(M = (I - \frac{L_A}{\sigma })\), we get that the first (largest) eigenvalue \(\lambda _1\) and the corresponding eigenvector \(\vec {u}_1\) are all positive and real. \(\square \)

Proof of Lemma 2

First, note that both matrices \(M= (I - \frac{L_A}{\sigma })\) and \(L_A\) are symmetric (see proof of Lemma 1 above). Hence, it follows that all their eigenvalues are real [34].

Now, consider the Laplacian matrix \(L(G)\) of graph \(G\). It is well known that its smallest eigenvalue is 0 [6]. Hence, all its eigenvalues are nonnegative. Let its eigenvalues be

$$\begin{aligned} 0 \le \mu _2 \le \mu _3 \le \cdots \le \mu _N \end{aligned}$$

Consider any cofactor matrix \(C_{LG}\) of \(L(G)\). Recall that a cofactor matrix is a principal submatrix resulting after the removal of one matching row and column. Clearly, \(C_{LG}\) is also symmetric and has all real eigenvalues. Let them be

$$\begin{aligned} \nu _1 \le \nu _2 \le \nu _3 \ldots \le \nu _{N-1} \end{aligned}$$

We can now apply the Cauchy eigenvalue interlacing theorem [34] to \(L(G)\) and \(C_{LG}\). Applying it, we get that

$$\begin{aligned} 0 \le \nu _1 \le \mu _2 \le \nu _2 \cdots \le \nu _{N-1} \le \mu _N \end{aligned}$$

Hence, all the eigenvalues of \(C_{LG}\) are nonnegative.

Now, recall that according to the famous Kirchoff matrix tree theorem [6], the determinant of any cofactor of the Laplacian matrix \(L(G)\) of graph \(G\) is equal to the number of spanning trees of \(G\). As the number can not be zero, the determinant of \(C_{LG}\) is also nonzero, i.e.,

$$\begin{aligned} \mathrm{det}(C_{LG}) > 0 \end{aligned}$$

Further, it is well known that the determinant of any matrix is equal to the product of its eigenvalues [34]. So,

$$\begin{aligned} \mathrm{det}(C_{LG}) = \prod _{i=1}^{N-1} \nu _i > 0 \Rightarrow \, \forall _i~~\nu _i > 0, \end{aligned}$$
(16)

i.e., none of the eigenvalues of \(C_{LG}\) are zero.

Recall that \(L_A\) is a principal submatrix of \(L(G)\)—hence, it is a cofactor of some other larger principal submatrix, which is a cofactor of some other still larger principal submatrix and so on till we reach \(C_{LG}\). Hence, there is a sequence of submatrices which can lead us from \(C_{LG}\) to \(L_A\), in which each submatrix is a cofactor of the one before it. Hence, applying Eq. 16 above successively to this sequence, we get that all the eigenvalues of \(L_A\) are strictly positive (nonzero).

Finally, note that any eigenvalue of \((I - \frac{L_A}{\sigma }), \mathrm{eig}((I - \frac{L_A}{\sigma })) = 1 - eig(\frac{L_A}{\sigma })\). Hence, as all the eigenvalues of \(L_A\) are nonzero, we get

$$\begin{aligned} \lambda _1 \left( 1 - \frac{L_A}{\sigma }\right) = 1 - \frac{\lambda _{N_I}(L_A)}{\sigma }, \end{aligned}$$

where \(\lambda _{N_I}(L_A)\) is the smallest eigenvalue of \(L_A\). \(\square \)

Proof of Lemma 3

We keep finding \(\mathcal S _k\) for each seed set size until MDL tells us to stop. Hence, the running time is \(O(k^* (\mathcal E _{I} + T_{\mathrm{RIPPLE}} + T_{\mathrm{MDL}}))\) if \(k^*\) is the optimal seed set size and \(T_{\mathrm{MDL}}\) is the running time of computing the MDL score given the seed set size is \(k^*\). Here, we used the fact that calculating the eigenvector using the Lanczos method is approximately \(O(E)\) (# edges) for sparse graphs.

The worst-case complexity \(T_{\mathrm{MDL}}\) of calculating \(\mathcal{L }(G_I, \mathcal{S }, R)\) for a given \(G_I, \mathcal{S }\), and \(R\) is \(O(\mathcal E _I + \mathcal E _{F} + \mathcal V _I)\). The \(\mathcal{L }(\mathcal{S })\) term is \(O(1)\). For the \(\mathcal{L }(R\mid \mathcal{S })\) term, we need to iterate over the ripple, which is at most \(\mathcal V _{I}\) steps long. We only have to update the frontier set \(\mathcal F \) when one or more nodes got infected, for which we then have to update the attack degrees of the nodes connected to the nodes infected at time \(t\). Hence, we traverse every edge in \(\mathcal E _{I}+\mathcal E _F\) and every node in \(\mathcal V _{I}\), which gives it the complexity of \(O(\mathcal E _I + \mathcal E _{F} + \mathcal V _{I})\).

Finally, the running time \(T_{\mathrm{RIPPLE}}\) of computation of the MLE ripple for a given \(\mathcal S _k\) is also \(O(\mathcal E _I + \mathcal E _{F} + \mathcal V _{I})\).

So, the overall complexity of NetSleuth is \(O(k^* (\mathcal E _I + \mathcal E _F + \mathcal V _I))\). \(\square \)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Prakash, B.A., Vreeken, J. & Faloutsos, C. Efficiently spotting the starting points of an epidemic in a large graph. Knowl Inf Syst 38, 35–59 (2014). https://doi.org/10.1007/s10115-013-0671-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-013-0671-5

Keywords

Navigation