Abstract
This paper studies the problem of identifying a single contagion source when partial timestamps of a contagion process are available. We formulate the source localization problem as a ranking problem on graphs, where infected nodes are ranked according to their likelihood of being the source. Two ranking algorithms, cost-based ranking and tree-based ranking, are proposed in this paper. Experimental evaluations with synthetic and real-world data show that our algorithms significantly improve the ranking accuracy compared with four existing algorithms.
Similar content being viewed by others
References
Agaskar A, Lu YM (2013) A fast monte carlo algorithm for source localization on graphs. In: SPIE optical engineering and applications
Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge Unversity Press, New York
Boyd S, Cortes C, Mohri M, Radovanovic A (2012) Accuracy at the top. In: Advances in neural information processing systems, pp 962–970
Breiger RL, Pattison PE (1986) Cumulated social roles: the duality of persons and their algebras. Soc Netw 8(3):215–256
Char J (1968) Generation of trees, two-trees, and storage of master forests. IEEE Trans Circuit Theory 15(3):228–238
Chen W, Wang Y, Yang S (2009) Efficient influence maximization in social networks. In: Proceedings of the annual ACM SIGKDD conference on knowledge discovery and data mining (KDD), pp 199–208
Chen W, Wang C, Wang Y (2010) Scalable influence maximization for prevalent viral marketing in large-scale social networks. In: Proceedings of the annual ACM SIGKDD conference on knowledge discovery and data mining (KDD), pp 1029–1038
Chen Z, Zhu K, Ying L (2014) Detecting multiple information sources in networks under the SIR model. In: Proceedings of the IEEE conference on information sciences and systems (CISS), Princeton
Dong W, Zhang W, Tan CW (2013) Rooting out the rumor culprit from suspects. In: Proceedings of the IEEE international symposium on information theory (ISIT), Istanbul, pp 2671–2675
Garey MR, Johnson DS (1979) Computers and intractibility: a guide to the theory of NP-completeness. Macmillan Higher Education, New York
Goyal A, Lu W, Lakshmanan LVS (2011) Simpath: an efficient algorithm for influence maximization under the linear threshold model. In: IEEE international conference on data mining (ICDM). IEEE Computer Society, Washington, DC, pp 211–220
Gruhl D, Guha R, Liben-Nowell D, Tomkins A (2004) Information diffusion through blogspace. In: Proceedings of the international conference on World Wide Web (WWW), New York, pp 491–501
Gundecha P, Feng Z, Liu H (2013) Seeking provenance of information using social media. In: Proceedings of the ACM international conference on information knowledge management (CIKM), San Francisco, pp 1691–1696
Karamchandani N, Franceschetti M (2013) Rumor source detection under probabilistic sampling. In: Proceedings of the IEEE international symposium on information theory (ISIT), Istanbul
Kempe D, Kleinberg J, Tardos E (2003) Maximizing the spread of influence through a social network. In: Proceedings of the annual ACM SIGKDD conference on knowledge discovery and data mining (KDD), Washington DC, pp 137–146
Lappas T, Terzi E, Gunopulos D, Mannila H (2010) Finding effectors in social networks. In: Proceedings of the annual ACM SIGKDD conference on knowledge discovery and data mining (KDD), pp 1059–1068
Lokhov AY, Mezard M, Ohta H, Zdeborova L (2013) Inferring the origin of an epidemy with dynamic message-passing algorithm. arXiv:1303.5315, preprint
Luo W, Tay WP (2013) Finding an infection source under the SIS model. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP), Vancouver
Luo W, Tay WP, Leng M (2013) Identifying infection sources and regions in large networks. IEEE Trans Signal Process 61:2850–2865
Luo W, Tay WP, Leng M (2014) How to identify an infection source with limited observations. IEEE J Sel Top Signal Process 8(4):586–597
Matsubara Y, Sakurai Y, Prakash BA, Li L, Faloutsos C (2012) Rise and fall patterns of information diffusion: model and implications. In: Proceedings of the annual ACM SIGKDD conference on knowledge discovery and data mining (KDD), Beijing, pp 6–14
Myers SA, Zhu C, Leskovec J (2012) Information diffusion and external influence in networks. In: Proceedings of the annual ACM SIGKDD conference on knowledge discovery and data mining (KDD), Beijing, pp 33–41
Nguyen DT, Nguyen NP, Thai MT (2012) Sources of misinformation in online social networks: who to suspect? In: Military communications conference, 2012-MILCOM 2012, IEEE, pp 1–6
Pinto PC, Thiran P, Vetterli M (2012) Locating the source of diffusion in large-scale networks. Phys Rev Lett 109(6):068,702
Prakash BA, Vreeken J, Faloutsos C (2012) Spotting culprits in epidemics: how many and which ones? In: IEEE international conference on Data Mining (ICDM), Brussels, pp 11–20
Sadikov E, Medina M, Leskovec J, Garcia-Molina H (2011) Correcting for missing data in information cascades. In: Proceedings of the fourth ACM international conference on web search and data mining, pp 55–64
Shah D, Zaman T (2011) Rumors in a network: who’s the culprit? IEEE Trans Inf Theory 57:5163–5181
Shah D, Zaman T (2012) Rumor centrality: a universal source detector. ACM SIGMETRICS Perform Eval Rev 40(1):199–210
Snow J (1854) The cholera near Golden-square, and at Deptford. Med Times Gaz 9:321–322
Wang Z, Dong W, Zhang W, Tan CW (2014) Rumor source detection with multiple observations: fundamental limits and algorithms. In: Proceedings of the annual ACM SIGMETRICS conference, Austin
Zejnilovic S, Gomes J, Sinopoli B (2013) Network observability and localization of the source of diffusion based on a subset of nodes. In: Proceedings of the annual Allerton conference on communication, control and computing, Monticello
Zhu K, Ying L (2013) Information source detection in the SIR model: a sample path based approach. In: Proceedings of information theory and applications workshop (ITA)
Zhu K, Ying L (2014) A robust information source estimator with sparse observations. In: Proceedings of the IEEE international conference on computer communications (INFOCOM), Toronto
Acknowledgments
This work was supported in part by the U.S. Army Research Laboratory’s Army Research Office (ARO Grant No. W911NF1310279).
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editors: Joao Gama, Indre Zliobaite, Alipio Jorge, and Concha Bielza.
Appendices
Appendix 1: Proof of Lemma 1
Define \(x_{k,k-1}=t_k-t_{k-1},\) so the cost C can be written as
The cost minimization problem can be written as
Note that \(C(\mathbf{x})\) is a convex function in \(\mathbf{x}.\) By verifying the KKT condition (Boyd and Vandenberghe 2004), it can be shown that the optimal solution to the problem above is \(x_{k,k-1}=\frac{\tau _n-\tau _1}{n-1},\) which implies \(t_k=\tau _1+(k-1)\frac{\tau _n-\tau _1}{n-1}.\)
Appendix 2: Proof of Theorem 1
Assume all nodes in the network are infected nodes and the infection time of two nodes (say Node v and Node w) are observed. Without loss of generality, assume \(\tau _v<\tau _w.\) Furthermore, assume the graph is undirected (i.e., all edges are bidirectional) and
We will prove the theorem by showing that computing the cost of Node v is related to the longest path problem between Nodes v and w.
To compute C(v), we consider those spreading trees rooted at Node v. Given a spreading tree \({\mathcal {P}}={{\mathcal {T}}, \mathbf{t}}\) rooted at Node v, denote by \({\mathcal {Q}}(v,w)\) the set of edges on the path from Node v to Node w. The cost of the spreading tree can be written as
Recall that only the infection time of Nodes v and w are known. Furthermore, Nodes v and w will not both appear on a path in \({\mathcal {T}}\backslash {\mathcal {Q}}(v,w).\) Therefore, by choosing \(\tau _u-\tau _h=\mu \) for each \((h,u)\in {\mathcal {E}}({\mathcal {T}})\backslash {\mathcal {Q}}(v,w),\) we have
Next applying Lemma 1, we obtain that
where the equality is achieved by assigning the timestamps according to Lemma 1.
For fixed \(|\tau _w-\tau _v|\) and \(\mu ,\) we have
where inequality (a) holds because of the assumption \(\tau _w-\tau _v>\mu (|{\mathcal {I}}|-1)\) and inequality (b) is due to \(|{\mathcal {Q}}(v,w)|\le |{\mathcal {I}}|-1.\) So (12) is a decreasing function of \(|{\mathcal {Q}}(v,w)|\) (the length of the path).
Let \(\eta \) denote the length of the longest path between v and w. Given the longest path between v and w, we can construct a spreading tree \({\mathcal {P}}^*\) by generating \({\mathcal {T}}^*\) using the breadth-first search starting from the longest path and assigning timestamps \(\mathbf{t}^*\) as mentioned above. Then,
Therefore, the algorithm that computes C(v) can be used to find the longest path between Nodes v and w. Since the longest path problem is NP-hard (Garey and Johnson 1979), the calculation of C(v) must also be NP-hard.
Appendix 3: Proof of Theorem 2
Note that the complexity of the modified breadth first search is \(O(|{\mathcal {E}}_I|)\) since each edge in the subgraph formed by the infected nodes only needs to be considered once. We next analyze the complexity of EIF:
-
Step 1 The complexity of computing the paths from an infected node to all other infected nodes is \(O(|{\mathcal {E}}_I|).\) Given \(|\alpha |\) infected nodes with timestamps, the computational complexity of Step 1 is \(O(|\alpha ||{\mathcal {E}}_I|).\)
-
Step 2 The complexity of sorting a list of size \(|\alpha |\) is \(O(|\alpha |\log (|\alpha |)).\)
-
Steps 3 and 4 To construct the spreading tree for a given node, \(|\alpha |\) infected nodes need to be attached in Steps 3 and 4. Each attachment requires the construction of a modified breadth-first tree, which has complexity \(O(|{\mathcal {E}}_I|).\) So the overall computational complexity of Steps 3 and 4 is \(O(|\alpha ||{\mathcal {E}}_I|).\)
-
Step 5 The breadth-first search algorithm is needed to complete the spreading tree, which has complexity \(O(|{\mathcal {E}}_I|).\)
From the discussion above, we can conclude that the computational complexity of constructing the spreading tree from a given node and calculating the associated cost is \(O(|\alpha ||{\mathcal {E}}_I|).\) CR (or TR) repeats EIF for each infected node, with complexity \(O(|\alpha ||{\mathcal {I}}||{\mathcal {E}}_I|),\) and then sort the infected nodes, with complexity \(O(|\mathcal {I}|\log |\mathcal {I}|).\) Therefore, the overall complexity of CR (or TR) is \(O(|\alpha ||{\mathcal {I}}||{\mathcal {E}}_I|).\)
Appendix 4: Additional experimental evaluation
In this section, we present additional experiments we conducted, including the comparison to Lappas’ algorithm under the IC model, the evaluation of the algorithms’ scalability and the evaluation using normalized rank.
1.1 Comparison to Lappas’ algorithm (Lappas et al. 2010)
In this section, we evaluate the performance of the algorithm in Lappas et al. (2010) (Lappas’ algorithm). Lappas’ algorithm was developed for the IC model and requires the infection probabilities of the IC model. Therefore, we only compared the algorithm in Lappas et al. (2010) on the IC model and the results are shown in Fig. 10. The experiments settings are the same as those in Sect. 5.2. We assume 50 % timestamps are observed for the TR, CR and GAU algorithms. As shown in Fig. 10, the \(\gamma \)%-accuracy of Lappas’ algorithm on the IAS network is significantly smaller than the TR and CR algorithms when \(\gamma \ge 10.\) In the PG network, the TR and CR algorithms dominates Lappas’ algorithm for all \(\gamma .\)
1.2 Scalability
We measured the execution time of the algorithms as shown in Fig. 11. The experiments are conducted on a Intel Core i5-3210M CPU with four cores and 8G RAM with a Windows 7 Professional 64 bit system. All algorithms are implemented with python 2.7. All the other settings are the same as those in Sect. 5.2 with \(\mu = 100.\) As shown in Fig. 11, CR and TR are more than six times faster than GAU when 50 % timestamps are observed. Although some other algorithms which do not use timestamps are faster, their performances are worse than TR, CR and GAU. Lappas’ algorithm is significantly slower than all the algorithms since Lappas’ algorithm is based on the full network while other algorithms are only based on the network with infected nodes or the neighbors of the infected nodes. In addition, as shown in Fig. 11b, the mean and the standard deviation of the running time of TR and CR are much smaller than those of GAU when the available timestamps are more than 10 %. Furthermore, the running time of TR and CR remains roughly the same as the number of timestamps increases while the running time of GAU increases significantly initially and then decreases a little bit. The decrease is because when more timestamps are observed, only the infected nodes with unobserved timestamps and the node which has the earliest observed timestamps could be the source which reduces the number of candidates hence the total running time.
1.3 Normalized rank
In addition to the \(\gamma \%\)-accuracy, we further evaluated the performance of the algorithms using the normalized rank, which is defined to be the ratio between the rank of the actual source and the total number of infected nodes. The observations are similar to the \(\gamma \%\)-accuracy except that CR performs better in the IAS network than TR in most cases and TR performs better in the PG network. The difference between GAU and TR and CR are smaller. The results show TR and CR not only achieve much better “accuracy-at-the-top”, but also improve the normalized rank in most cases. We next present a short summary for each set of simulations.
1.3.1 The impact of timestamp distribution
Table 6, 7, 8, 9, 10 and 11 show the normalized rank for the truncated Gaussian model for the IAS network and the PG network. The settings of the experiments are same as those in Sect. 5.3. In the IAS network, the CR algorithm yields the smallest normalized ranks and standard deviations when there are more than 10 % of timestamps are observed. In the PG network, TR yields the smallest normalized ranks and standard deviations.
1.3.2 The impact of the diffusion model
Table 12, 13, 14 and 15 show the normalized rank under the IC model and SpikeM model. The settings are the same as that in Sect. 5.4. GAU has better or similar performance as TR and CR when the fraction of observed timestamps is small, but yields a larger normalized rank when the number of observed timestamps increases.
1.3.3 The impact of network topology
Table 16 shows the normalized rank when we remove the edges from the IAS network. The settings are the same as that in Sect. 5.5 and CR dominates in this case.
1.3.4 Weibo data evaluation
Table 17 shows the normalized rank for the Weibo data. The settings are the same as that in Sect. 5.6. We observed that the CR algorithm with 30 % timestamps has the minimum normalized rank for all tweet cascades sizes.
Rights and permissions
About this article
Cite this article
Zhu, K., Chen, Z. & Ying, L. Locating the contagion source in networks with partial timestamps. Data Min Knowl Disc 30, 1217–1248 (2016). https://doi.org/10.1007/s10618-015-0435-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-015-0435-9