Efficiently spotting the starting points of an epidemic in a large graph

Prakash, B. Aditya; Vreeken, Jilles; Faloutsos, Christos

doi:10.1007/s10115-013-0671-5

Efficiently spotting the starting points of an epidemic in a large graph

Regular paper
Published: 17 July 2013

Volume 38, pages 35–59, (2014)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

B. Aditya Prakash¹,
Jilles Vreeken² &
Christos Faloutsos³

921 Accesses
45 Citations
Explore all metrics

Abstract

Given a snapshot of a large graph, in which an infection has been spreading for some time, can we identify those nodes from which the infection started to spread? In other words, can we reliably tell who the culprits are? In this paper, we answer this question affirmatively and give an efficient method called NetSleuth for the well-known susceptible-infected virus propagation model. Essentially, we are after that set of seed nodes that best explain the given snapshot. We propose to employ the minimum description length principle to identify the best set of seed nodes and virus propagation ripple, as the one by which we can most succinctly describe the infected graph. We give an highly efficient algorithm to identify likely sets of seed nodes given a snapshot. Then, given these seed nodes, we show we can optimize the virus propagation ripple in a principled way by maximizing likelihood. With all three combined, NetSleuth can automatically identify the correct number of seed nodes, as well as which nodes are the culprits. Experimentation on our method shows high accuracy in the detection of seed nodes, in addition to the correct automatic identification of their number. Moreover, NetSleuth scales linearly in the number of nodes of the graph.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Emergence in complex networks of simple agents

Article Open access 23 May 2023

Complex Networks: a Mini-review

Article 13 July 2020

A new semi-local centrality for identifying influential nodes based on local average shortest path with extended neighborhood

Article Open access 13 April 2024

Notes

When not interested in the actual ripple $R$, one could encode $G_I$ by its overall probability starting from $\mathcal{S }$. Obtaining this probability, however, is very expensive, even by MCMC sampling. As we will see in Sects. 5 and 6, computing a good ripple is both cheap and gives good results.
For more information, see http://topology.eecs.umich.edu/data.html.
We use the standard definition of Jaccard distance between two sets $\mathcal A $ and $\mathcal B = 1 - \frac{|\mathcal A \cap \mathcal B |}{|\mathcal A \cup \mathcal B |}$.

References

Anderson RM, May RM (1991) Infectious diseases of humans: dynamics and control. Oxford University Press, Oxford
Google Scholar
Bikhchandani S, Hirshleifer D, Welch I (1992) A theory of fads, fashion, custom, and cultural change in informational cascades. Polit Econ 100(5):992–1026
Google Scholar
Briesemeister L, Lincoln P, Porras P (2003) Epidemic profiles and defense of scale-free networks. In: WORM 2003, Washington, DC
Cilibrasi R, Vitányi P (2005) Clustering by compression. IEEE Trans Inf Technol 51(4):1523–1545
Article Google Scholar
Cover TM, Thomas JA (2006) Elements of information theory. Wiley-Interscience, New York, pp 110–112
MATH Google Scholar
Cvetković DM, Doob M, Sachs H (1998) Spectra of graphs: theory and applications, 3rd edn
Chakrabarti D, Papadimitriou S, Modha DS, Faloutsos C (2004) Fully automatic cross-associations. In: Proceedings of the 10th ACM international conference on knowledge discovery and data mining (SIGKDD), Seattle, WA, pp 79–88
Chakrabarti D, Wang Y, Wang C, Leskovec J, Faloutsos C (2008) Epidemic thresholds in real networks. TISSEC 10(4)
Chen W, Wang C, Wang Y (2010) Scalable influence maximization for prevalent viral marketing in large-scale social networks. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’10). ACM, New York, pp 1029–1038. doi:10.1145/1835804.1835934. http://doi.acm.org/10.1145/1835804.1835934
Faloutsos C, Megalooikonomou V (2007) On data mining, compression and Kolmogorov complexity. In: Webb G (ed) Data mining and knowledge discovery, vol 15. Springer, Berlin, pp 3–20
Grünwald P (2007) The minimum description length principle. MIT Press, Cambridge
Google Scholar
Goldenberg J, Libai B, Muller E (2001) Talk of the network: a complex systems look at the underlying process of word-of-mouth. Market Lett 12(3):211–223
Google Scholar
Gruhl D, Guha R, Liben-Nowell D, Tomkins A (2004) Information diffusion through blogspace. In: Proceedings of the 13th international conference on World Wide Web (WWW)
Ganesh A, Massoulié L, Towsley D (2005) The effect of network topology on the spread of epidemics. In: INFOCOM
Goyal A, Lu W, Lakshmanan LVS (2011) Simpath: an efficient algorithm for influence maximization under the linear threshold model. In: Proceedings of the 11th IEEE international conference on data mining (ICDM), Vancouver, Canada
Kephart JO, White SR (1993) Measuring and modeling computer virus prevalence. In: SP
Kempe D, Kleinberg J, Tardos E (2003) Maximizing the spread of influence through a social network. In: Proceedings of the 9th ACM international conference on knowledge discovery and data mining (SIGKDD), Washington, DC
Li M, Vitányi P (1993) An introduction to Kolmogorov complexity and its applications. Springer, Berlin
Book MATH Google Scholar
Leskovec J, Adamic LA, Huberman BA (2006) The dynamics of viral marketing. In: EC
Leskovec J, Krause A, Guestrin C, Faloutsos C, VanBriesen J, Glance NS (2007a) Cost-effective outbreak detection in networks. In: Proceedings of the 13th ACM international conference on knowledge discovery and data mining (SIGKDD), San Jose, CA, pp 420–429
Leskovec J, McGlohon M, Faloutsos C, Glance N, Hurst M (2007b) Cascading behavior in large blog graphs: patterns and a model. In: Proceedings of the 7th SIAM international conference on data mining (SDM), Minneapolis, MN
Lappas T, Terzi E, Gunopulos D, Mannila H (2010) Finding effectors in social networks. In: Proceedings of the 16th ACM international conference on knowledge discovery and data mining (SIGKDD), Washington, DC, pp 1059–1068
McCuler CR (2000) The many proofs and applications of Perron’s theorem. SIAM Rev 42:1
Google Scholar
Pastor-Santorras R, Vespignani A (2001) Epidemic spreading in scale-free networks. Phys Rev Lett 86:14
Article Google Scholar
Prakash BA, Tong H, Valler N, Faloutsos M, Faloutsos C (2010) Virus propagation on time-varying networks: theory and immunization algorithms. In: Proceedings of the European conference on machine learning and principles and practice of knowledge discovery in databases (ECML PKDD), Barcelona, Spain
Prakash BA, Chakrabarti D, Faloutsos M, Valler N, Faloutsos C (2011) Threshold conditions for arbitrary cascade models on arbitrary networks. In: Proceedings of the 11th IEEE international conference on data mining (ICDM), Vancouver, Canada
Prakash BA, Chakrabarti D, Valler N, Faloutsos M, Faloutsos C (2012) Threshold conditions for arbitrary cascade models on arbitrary networks. Knowl Inf Syst 33(3):549–575
Article Google Scholar
Rissanen J (1978) Modeling by shortest data description. Automatica 14(1):465–471
Article MATH Google Scholar
Rissanen J (1983) Modeling by shortest data description. Ann Stat 11(2):416–431
Article MATH MathSciNet Google Scholar
Richardson M, Domingos P (2002) Mining knowledge-sharing sites for viral marketing. In: Proceedings of the 8th ACM international conference on knowledge discovery and data mining (SIGKDD), Edmonton, Alberta
Roos T, Rissanen J (2008) On sequentially normalized maximum likelihood models. In: Proceedings of the workshop on information theoretic methods in science and engineering (WITMSE)
Saito K, Kimura M, Ohara K, Motoda H (2012) Efficient discovery of influential nodes for sis models in social networks. Knowl Inf Syst 30(3):613–635
Article Google Scholar
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Article MATH Google Scholar
Strang G (1988) Linear algebra and its applications, 3rd edn. Harcourt Brace Jonanovich, San Diego
Google Scholar
Shah D, Zaman T (2010) Detecting sources of computer viruses in networks: theory and experiment. In: SIGMETRICS, pp 203–214
Shah D, Zaman T (2011) Rumors in a network: who’s the culprit? IEEE Trans Inf Technol 57(8):5163–5181
Article MathSciNet Google Scholar
Smets K, Vreeken J (2011) The odd one out: Identifying and characterising anomalies. In: Proceedings of the 11th SIAM international conference on data mining (SDM), Mesa, AZ, society for industrial and applied mathematics (SIAM), pp 804–815
Tong H, Prakash BA, Tsourakakis CE, Eliassi-Rad T, Faloutsos C, Chau DH (2010) On the vulnerability of large graphs. In: Proceedings of the 10th IEEE international conference on data mining (ICDM), Sydney, Australia
Vereshchagin N, Vitanyi P (2004) Kolmogorov’s structure functions and model selection. IEEE Trans Inf Technol 50(12):3265–3290
Article MathSciNet Google Scholar
Vreeken J, van Leeuwen M, Siebes A (2011) Krimp: mining itemsets that compress. Data Min Knowl Discov 23(1):169–214
Article MATH MathSciNet Google Scholar
Wallace C (2005) Statistical and inductive inference by minimum message length. Springer, Berlin
MATH Google Scholar
Zhao J, Wu J, Feng X, Xiong H, Xu K (2011) Information propagation in online social networks: a tie-strength perspective. Knowl Inf Syst. 1–20. doi:10.1007/s10115-011-0445-x

Download references

Acknowledgments

This material is based upon work supported by the Army Research Laboratory under Cooperative Agreement No. W911NF-09-2-0053 and the National Science Foundation under Grant No. IIS-1017415. Jilles Vreeken is supported by a Postdoctoral Fellowship of the Research Foundation—Flanders (fwo).

Author information

Authors and Affiliations

Department of Computer Science, Virginia Tech., 2202 Kraft Drive, Blacksburg, VA, 24060, USA
B. Aditya Prakash
Advanced Database Research and Modeling, University of Antwerp, Antwerp, Belgium
Jilles Vreeken
Department of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
Christos Faloutsos

Authors

B. Aditya Prakash
View author publications
You can also search for this author in PubMed Google Scholar
Jilles Vreeken
View author publications
You can also search for this author in PubMed Google Scholar
Christos Faloutsos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to B. Aditya Prakash.

Appendix 1

1.1 Proofs

We formally prove various Lemmas used in the paper in this section.

Proof of Lemma 1

The graph $G_I$ is connected (as we assume the set of infected nodes are connected, otherwise we are just dealing with separate problems, one for each connected component). Now, the Laplacian matrix $L(G)$ has all entries except the diagonal elements as nonpositive and all the diagonal elements as positive. In addition, $L(G)$ is a symmetric matrix (as $A(G)$ is symmetric). The matrix $L_A$ is a principal submatrix (i.e., it has been formed by removing matching rows and columns) of size $N_I \times N_I$ of $L(G)$. As a result, $L_A$ is symmetric.

Consider matrix $M = (I - \frac{L_A}{\sigma })$. Clearly, it is symmetric (due to the above). Consider some diagonal element $M_{ii}$ (for some index i):

$$\begin{aligned} M_{ii}&= 1 - \frac{d_i}{1+\sigma } \nonumber \\&= 1 - \frac{d_i}{1+d_{\mathrm{max}}} \end{aligned}$$

(14)

$$\begin{aligned}&> 0 \end{aligned}$$

(15)

Any off-diagonal element $M_{ij}$ is

$$\begin{aligned} M_{ij} = {\left\{ \begin{array}{ll} 0, &{} \mathrm{if }\,\{i, j\} \notin E_I\\ \frac{1}{1+d_{\mathrm{max}}}, &{} \mathrm{if }\,\{i, j\} \in E_I \end{array}\right. } \end{aligned}$$

Hence, first, $M$ is a nonnegative matrix. Further, the structure of the matrix $M = (I - \frac{L_A}{\sigma })$ represents the adjacency matrix of a weighted connected graph $G_M$ (with self-loops). This is because its off-diagonal elements are nonzero only when the corresponding edge is present in $G_I$. Hence, as $G_I$ is connected, so is $G_M$. Now, as $G_M$ is connected, we have that $M$ is irreducible.

Finally, applying the well-known Perron–Frobenius theorem [23] on the nonnegative irreducible matrix $M = (I - \frac{L_A}{\sigma })$, we get that the first (largest) eigenvalue $\lambda _1$ and the corresponding eigenvector $\vec {u}_1$ are all positive and real. $\square $

Proof of Lemma 2

First, note that both matrices $M= (I - \frac{L_A}{\sigma })$ and $L_A$ are symmetric (see proof of Lemma 1 above). Hence, it follows that all their eigenvalues are real [34].

Now, consider the Laplacian matrix $L(G)$ of graph $G$. It is well known that its smallest eigenvalue is 0 [6]. Hence, all its eigenvalues are nonnegative. Let its eigenvalues be

$$\begin{aligned} 0 \le \mu _2 \le \mu _3 \le \cdots \le \mu _N \end{aligned}$$

Consider any cofactor matrix $C_{LG}$ of $L(G)$. Recall that a cofactor matrix is a principal submatrix resulting after the removal of one matching row and column. Clearly, $C_{LG}$ is also symmetric and has all real eigenvalues. Let them be

$$\begin{aligned} \nu _1 \le \nu _2 \le \nu _3 \ldots \le \nu _{N-1} \end{aligned}$$

We can now apply the Cauchy eigenvalue interlacing theorem [34] to $L(G)$ and $C_{LG}$. Applying it, we get that

$$\begin{aligned} 0 \le \nu _1 \le \mu _2 \le \nu _2 \cdots \le \nu _{N-1} \le \mu _N \end{aligned}$$

Hence, all the eigenvalues of $C_{LG}$ are nonnegative.

Now, recall that according to the famous Kirchoff matrix tree theorem [6], the determinant of any cofactor of the Laplacian matrix $L(G)$ of graph $G$ is equal to the number of spanning trees of $G$. As the number can not be zero, the determinant of $C_{LG}$ is also nonzero, i.e.,

$$\begin{aligned} \mathrm{det}(C_{LG}) > 0 \end{aligned}$$

Further, it is well known that the determinant of any matrix is equal to the product of its eigenvalues [34]. So,

$$\begin{aligned} \mathrm{det}(C_{LG}) = \prod _{i=1}^{N-1} \nu _i > 0 \Rightarrow \, \forall _i~~\nu _i > 0, \end{aligned}$$

(16)

i.e., none of the eigenvalues of $C_{LG}$ are zero.

Recall that $L_A$ is a principal submatrix of $L(G)$—hence, it is a cofactor of some other larger principal submatrix, which is a cofactor of some other still larger principal submatrix and so on till we reach $C_{LG}$. Hence, there is a sequence of submatrices which can lead us from $C_{LG}$ to $L_A$, in which each submatrix is a cofactor of the one before it. Hence, applying Eq. 16 above successively to this sequence, we get that all the eigenvalues of $L_A$ are strictly positive (nonzero).

Finally, note that any eigenvalue of $(I - \frac{L_A}{\sigma }), \mathrm{eig}((I - \frac{L_A}{\sigma })) = 1 - eig(\frac{L_A}{\sigma })$. Hence, as all the eigenvalues of $L_A$ are nonzero, we get

$$\begin{aligned} \lambda _1 \left( 1 - \frac{L_A}{\sigma }\right) = 1 - \frac{\lambda _{N_I}(L_A)}{\sigma }, \end{aligned}$$

where $\lambda _{N_I}(L_A)$ is the smallest eigenvalue of $L_A$. $\square $

Proof of Lemma 3

We keep finding $\mathcal S _k$ for each seed set size until MDL tells us to stop. Hence, the running time is $O(k^* (\mathcal E _{I} + T_{\mathrm{RIPPLE}} + T_{\mathrm{MDL}}))$ if $k^*$ is the optimal seed set size and $T_{\mathrm{MDL}}$ is the running time of computing the MDL score given the seed set size is $k^*$. Here, we used the fact that calculating the eigenvector using the Lanczos method is approximately $O(E)$ (# edges) for sparse graphs.

The worst-case complexity $T_{\mathrm{MDL}}$ of calculating $\mathcal{L }(G_I, \mathcal{S }, R)$ for a given $G_I, \mathcal{S }$, and $R$ is $O(\mathcal E _I + \mathcal E _{F} + \mathcal V _I)$. The $\mathcal{L }(\mathcal{S })$ term is $O(1)$. For the $\mathcal{L }(R\mid \mathcal{S })$ term, we need to iterate over the ripple, which is at most $\mathcal V _{I}$ steps long. We only have to update the frontier set $\mathcal F $ when one or more nodes got infected, for which we then have to update the attack degrees of the nodes connected to the nodes infected at time $t$. Hence, we traverse every edge in $\mathcal E _{I}+\mathcal E _F$ and every node in $\mathcal V _{I}$, which gives it the complexity of $O(\mathcal E _I + \mathcal E _{F} + \mathcal V _{I})$.

Finally, the running time $T_{\mathrm{RIPPLE}}$ of computation of the MLE ripple for a given $\mathcal S _k$ is also $O(\mathcal E _I + \mathcal E _{F} + \mathcal V _{I})$.

So, the overall complexity of NetSleuth is $O(k^* (\mathcal E _I + \mathcal E _F + \mathcal V _I))$. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Prakash, B.A., Vreeken, J. & Faloutsos, C. Efficiently spotting the starting points of an epidemic in a large graph. Knowl Inf Syst 38, 35–59 (2014). https://doi.org/10.1007/s10115-013-0671-5

Download citation

Received: 16 February 2013
Revised: 04 May 2013
Accepted: 25 May 2013
Published: 17 July 2013
Issue Date: January 2014
DOI: https://doi.org/10.1007/s10115-013-0671-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficiently spotting the starting points of an epidemic in a large graph

Abstract

Access this article

Similar content being viewed by others

Emergence in complex networks of simple agents

Complex Networks: a Mini-review

A new semi-local centrality for identifying influential nodes based on local average shortest path with extended neighborhood

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix 1

1.1 Proofs

Proof of Lemma 1

Proof of Lemma 2

Proof of Lemma 3

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Efficiently spotting the starting points of an epidemic in a large graph

Abstract

Access this article

Similar content being viewed by others

Emergence in complex networks of simple agents

Complex Networks: a Mini-review

A new semi-local centrality for identifying influential nodes based on local average shortest path with extended neighborhood

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix 1

Appendix 1

1.1 Proofs

Proof of Lemma 1

Proof of Lemma 2

Proof of Lemma 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation