Abstract
Given a snapshot of a large graph, in which an infection has been spreading for some time, can we identify those nodes from which the infection started to spread? In other words, can we reliably tell who the culprits are? In this paper, we answer this question affirmatively and give an efficient method called NetSleuth for the well-known susceptible-infected virus propagation model. Essentially, we are after that set of seed nodes that best explain the given snapshot. We propose to employ the minimum description length principle to identify the best set of seed nodes and virus propagation ripple, as the one by which we can most succinctly describe the infected graph. We give an highly efficient algorithm to identify likely sets of seed nodes given a snapshot. Then, given these seed nodes, we show we can optimize the virus propagation ripple in a principled way by maximizing likelihood. With all three combined, NetSleuth can automatically identify the correct number of seed nodes, as well as which nodes are the culprits. Experimentation on our method shows high accuracy in the detection of seed nodes, in addition to the correct automatic identification of their number. Moreover, NetSleuth scales linearly in the number of nodes of the graph.






Similar content being viewed by others
Notes
When not interested in the actual ripple \(R\), one could encode \(G_I\) by its overall probability starting from \(\mathcal{S }\). Obtaining this probability, however, is very expensive, even by MCMC sampling. As we will see in Sects. 5 and 6, computing a good ripple is both cheap and gives good results.
For more information, see http://topology.eecs.umich.edu/data.html.
We use the standard definition of Jaccard distance between two sets \(\mathcal A \) and \(\mathcal B = 1 - \frac{|\mathcal A \cap \mathcal B |}{|\mathcal A \cup \mathcal B |}\).
References
Anderson RM, May RM (1991) Infectious diseases of humans: dynamics and control. Oxford University Press, Oxford
Bikhchandani S, Hirshleifer D, Welch I (1992) A theory of fads, fashion, custom, and cultural change in informational cascades. Polit Econ 100(5):992–1026
Briesemeister L, Lincoln P, Porras P (2003) Epidemic profiles and defense of scale-free networks. In: WORM 2003, Washington, DC
Cilibrasi R, Vitányi P (2005) Clustering by compression. IEEE Trans Inf Technol 51(4):1523–1545
Cover TM, Thomas JA (2006) Elements of information theory. Wiley-Interscience, New York, pp 110–112
Cvetković DM, Doob M, Sachs H (1998) Spectra of graphs: theory and applications, 3rd edn
Chakrabarti D, Papadimitriou S, Modha DS, Faloutsos C (2004) Fully automatic cross-associations. In: Proceedings of the 10th ACM international conference on knowledge discovery and data mining (SIGKDD), Seattle, WA, pp 79–88
Chakrabarti D, Wang Y, Wang C, Leskovec J, Faloutsos C (2008) Epidemic thresholds in real networks. TISSEC 10(4)
Chen W, Wang C, Wang Y (2010) Scalable influence maximization for prevalent viral marketing in large-scale social networks. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’10). ACM, New York, pp 1029–1038. doi:10.1145/1835804.1835934. http://doi.acm.org/10.1145/1835804.1835934
Faloutsos C, Megalooikonomou V (2007) On data mining, compression and Kolmogorov complexity. In: Webb G (ed) Data mining and knowledge discovery, vol 15. Springer, Berlin, pp 3–20
Grünwald P (2007) The minimum description length principle. MIT Press, Cambridge
Goldenberg J, Libai B, Muller E (2001) Talk of the network: a complex systems look at the underlying process of word-of-mouth. Market Lett 12(3):211–223
Gruhl D, Guha R, Liben-Nowell D, Tomkins A (2004) Information diffusion through blogspace. In: Proceedings of the 13th international conference on World Wide Web (WWW)
Ganesh A, Massoulié L, Towsley D (2005) The effect of network topology on the spread of epidemics. In: INFOCOM
Goyal A, Lu W, Lakshmanan LVS (2011) Simpath: an efficient algorithm for influence maximization under the linear threshold model. In: Proceedings of the 11th IEEE international conference on data mining (ICDM), Vancouver, Canada
Kephart JO, White SR (1993) Measuring and modeling computer virus prevalence. In: SP
Kempe D, Kleinberg J, Tardos E (2003) Maximizing the spread of influence through a social network. In: Proceedings of the 9th ACM international conference on knowledge discovery and data mining (SIGKDD), Washington, DC
Li M, Vitányi P (1993) An introduction to Kolmogorov complexity and its applications. Springer, Berlin
Leskovec J, Adamic LA, Huberman BA (2006) The dynamics of viral marketing. In: EC
Leskovec J, Krause A, Guestrin C, Faloutsos C, VanBriesen J, Glance NS (2007a) Cost-effective outbreak detection in networks. In: Proceedings of the 13th ACM international conference on knowledge discovery and data mining (SIGKDD), San Jose, CA, pp 420–429
Leskovec J, McGlohon M, Faloutsos C, Glance N, Hurst M (2007b) Cascading behavior in large blog graphs: patterns and a model. In: Proceedings of the 7th SIAM international conference on data mining (SDM), Minneapolis, MN
Lappas T, Terzi E, Gunopulos D, Mannila H (2010) Finding effectors in social networks. In: Proceedings of the 16th ACM international conference on knowledge discovery and data mining (SIGKDD), Washington, DC, pp 1059–1068
McCuler CR (2000) The many proofs and applications of Perron’s theorem. SIAM Rev 42:1
Pastor-Santorras R, Vespignani A (2001) Epidemic spreading in scale-free networks. Phys Rev Lett 86:14
Prakash BA, Tong H, Valler N, Faloutsos M, Faloutsos C (2010) Virus propagation on time-varying networks: theory and immunization algorithms. In: Proceedings of the European conference on machine learning and principles and practice of knowledge discovery in databases (ECML PKDD), Barcelona, Spain
Prakash BA, Chakrabarti D, Faloutsos M, Valler N, Faloutsos C (2011) Threshold conditions for arbitrary cascade models on arbitrary networks. In: Proceedings of the 11th IEEE international conference on data mining (ICDM), Vancouver, Canada
Prakash BA, Chakrabarti D, Valler N, Faloutsos M, Faloutsos C (2012) Threshold conditions for arbitrary cascade models on arbitrary networks. Knowl Inf Syst 33(3):549–575
Rissanen J (1978) Modeling by shortest data description. Automatica 14(1):465–471
Rissanen J (1983) Modeling by shortest data description. Ann Stat 11(2):416–431
Richardson M, Domingos P (2002) Mining knowledge-sharing sites for viral marketing. In: Proceedings of the 8th ACM international conference on knowledge discovery and data mining (SIGKDD), Edmonton, Alberta
Roos T, Rissanen J (2008) On sequentially normalized maximum likelihood models. In: Proceedings of the workshop on information theoretic methods in science and engineering (WITMSE)
Saito K, Kimura M, Ohara K, Motoda H (2012) Efficient discovery of influential nodes for sis models in social networks. Knowl Inf Syst 30(3):613–635
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Strang G (1988) Linear algebra and its applications, 3rd edn. Harcourt Brace Jonanovich, San Diego
Shah D, Zaman T (2010) Detecting sources of computer viruses in networks: theory and experiment. In: SIGMETRICS, pp 203–214
Shah D, Zaman T (2011) Rumors in a network: who’s the culprit? IEEE Trans Inf Technol 57(8):5163–5181
Smets K, Vreeken J (2011) The odd one out: Identifying and characterising anomalies. In: Proceedings of the 11th SIAM international conference on data mining (SDM), Mesa, AZ, society for industrial and applied mathematics (SIAM), pp 804–815
Tong H, Prakash BA, Tsourakakis CE, Eliassi-Rad T, Faloutsos C, Chau DH (2010) On the vulnerability of large graphs. In: Proceedings of the 10th IEEE international conference on data mining (ICDM), Sydney, Australia
Vereshchagin N, Vitanyi P (2004) Kolmogorov’s structure functions and model selection. IEEE Trans Inf Technol 50(12):3265–3290
Vreeken J, van Leeuwen M, Siebes A (2011) Krimp: mining itemsets that compress. Data Min Knowl Discov 23(1):169–214
Wallace C (2005) Statistical and inductive inference by minimum message length. Springer, Berlin
Zhao J, Wu J, Feng X, Xiong H, Xu K (2011) Information propagation in online social networks: a tie-strength perspective. Knowl Inf Syst. 1–20. doi:10.1007/s10115-011-0445-x
Acknowledgments
This material is based upon work supported by the Army Research Laboratory under Cooperative Agreement No. W911NF-09-2-0053 and the National Science Foundation under Grant No. IIS-1017415. Jilles Vreeken is supported by a Postdoctoral Fellowship of the Research Foundation—Flanders (fwo).
Author information
Authors and Affiliations
Corresponding author
Appendix 1
Appendix 1
1.1 Proofs
We formally prove various Lemmas used in the paper in this section.
Proof of Lemma 1
The graph \(G_I\) is connected (as we assume the set of infected nodes are connected, otherwise we are just dealing with separate problems, one for each connected component). Now, the Laplacian matrix \(L(G)\) has all entries except the diagonal elements as nonpositive and all the diagonal elements as positive. In addition, \(L(G)\) is a symmetric matrix (as \(A(G)\) is symmetric). The matrix \(L_A\) is a principal submatrix (i.e., it has been formed by removing matching rows and columns) of size \(N_I \times N_I\) of \(L(G)\). As a result, \(L_A\) is symmetric.
Consider matrix \(M = (I - \frac{L_A}{\sigma })\). Clearly, it is symmetric (due to the above). Consider some diagonal element \(M_{ii}\) (for some index i):
Any off-diagonal element \(M_{ij}\) is
Hence, first, \(M\) is a nonnegative matrix. Further, the structure of the matrix \(M = (I - \frac{L_A}{\sigma })\) represents the adjacency matrix of a weighted connected graph \(G_M\) (with self-loops). This is because its off-diagonal elements are nonzero only when the corresponding edge is present in \(G_I\). Hence, as \(G_I\) is connected, so is \(G_M\). Now, as \(G_M\) is connected, we have that \(M\) is irreducible.
Finally, applying the well-known Perron–Frobenius theorem [23] on the nonnegative irreducible matrix \(M = (I - \frac{L_A}{\sigma })\), we get that the first (largest) eigenvalue \(\lambda _1\) and the corresponding eigenvector \(\vec {u}_1\) are all positive and real. \(\square \)
Proof of Lemma 2
First, note that both matrices \(M= (I - \frac{L_A}{\sigma })\) and \(L_A\) are symmetric (see proof of Lemma 1 above). Hence, it follows that all their eigenvalues are real [34].
Now, consider the Laplacian matrix \(L(G)\) of graph \(G\). It is well known that its smallest eigenvalue is 0 [6]. Hence, all its eigenvalues are nonnegative. Let its eigenvalues be
Consider any cofactor matrix \(C_{LG}\) of \(L(G)\). Recall that a cofactor matrix is a principal submatrix resulting after the removal of one matching row and column. Clearly, \(C_{LG}\) is also symmetric and has all real eigenvalues. Let them be
We can now apply the Cauchy eigenvalue interlacing theorem [34] to \(L(G)\) and \(C_{LG}\). Applying it, we get that
Hence, all the eigenvalues of \(C_{LG}\) are nonnegative.
Now, recall that according to the famous Kirchoff matrix tree theorem [6], the determinant of any cofactor of the Laplacian matrix \(L(G)\) of graph \(G\) is equal to the number of spanning trees of \(G\). As the number can not be zero, the determinant of \(C_{LG}\) is also nonzero, i.e.,
Further, it is well known that the determinant of any matrix is equal to the product of its eigenvalues [34]. So,
i.e., none of the eigenvalues of \(C_{LG}\) are zero.
Recall that \(L_A\) is a principal submatrix of \(L(G)\)—hence, it is a cofactor of some other larger principal submatrix, which is a cofactor of some other still larger principal submatrix and so on till we reach \(C_{LG}\). Hence, there is a sequence of submatrices which can lead us from \(C_{LG}\) to \(L_A\), in which each submatrix is a cofactor of the one before it. Hence, applying Eq. 16 above successively to this sequence, we get that all the eigenvalues of \(L_A\) are strictly positive (nonzero).
Finally, note that any eigenvalue of \((I - \frac{L_A}{\sigma }), \mathrm{eig}((I - \frac{L_A}{\sigma })) = 1 - eig(\frac{L_A}{\sigma })\). Hence, as all the eigenvalues of \(L_A\) are nonzero, we get
where \(\lambda _{N_I}(L_A)\) is the smallest eigenvalue of \(L_A\). \(\square \)
Proof of Lemma 3
We keep finding \(\mathcal S _k\) for each seed set size until MDL tells us to stop. Hence, the running time is \(O(k^* (\mathcal E _{I} + T_{\mathrm{RIPPLE}} + T_{\mathrm{MDL}}))\) if \(k^*\) is the optimal seed set size and \(T_{\mathrm{MDL}}\) is the running time of computing the MDL score given the seed set size is \(k^*\). Here, we used the fact that calculating the eigenvector using the Lanczos method is approximately \(O(E)\) (# edges) for sparse graphs.
The worst-case complexity \(T_{\mathrm{MDL}}\) of calculating \(\mathcal{L }(G_I, \mathcal{S }, R)\) for a given \(G_I, \mathcal{S }\), and \(R\) is \(O(\mathcal E _I + \mathcal E _{F} + \mathcal V _I)\). The \(\mathcal{L }(\mathcal{S })\) term is \(O(1)\). For the \(\mathcal{L }(R\mid \mathcal{S })\) term, we need to iterate over the ripple, which is at most \(\mathcal V _{I}\) steps long. We only have to update the frontier set \(\mathcal F \) when one or more nodes got infected, for which we then have to update the attack degrees of the nodes connected to the nodes infected at time \(t\). Hence, we traverse every edge in \(\mathcal E _{I}+\mathcal E _F\) and every node in \(\mathcal V _{I}\), which gives it the complexity of \(O(\mathcal E _I + \mathcal E _{F} + \mathcal V _{I})\).
Finally, the running time \(T_{\mathrm{RIPPLE}}\) of computation of the MLE ripple for a given \(\mathcal S _k\) is also \(O(\mathcal E _I + \mathcal E _{F} + \mathcal V _{I})\).
So, the overall complexity of NetSleuth is \(O(k^* (\mathcal E _I + \mathcal E _F + \mathcal V _I))\). \(\square \)
Rights and permissions
About this article
Cite this article
Prakash, B.A., Vreeken, J. & Faloutsos, C. Efficiently spotting the starting points of an epidemic in a large graph. Knowl Inf Syst 38, 35–59 (2014). https://doi.org/10.1007/s10115-013-0671-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-013-0671-5