Skip to main content
Log in

Locating the contagion source in networks with partial timestamps

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

This paper studies the problem of identifying a single contagion source when partial timestamps of a contagion process are available. We formulate the source localization problem as a ranking problem on graphs, where infected nodes are ranked according to their likelihood of being the source. Two ranking algorithms, cost-based ranking and tree-based ranking, are proposed in this paper. Experimental evaluations with synthetic and real-world data show that our algorithms significantly improve the ranking accuracy compared with four existing algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. http://www.weibo.com/.

  2. Available at http://snap.stanford.edu/data/index.html.

  3. Available at http://www-personal.umich.edu/~mejn/netdata/.

  4. http://www.weibo.com/.

  5. http://www.wise2012.cs.ucy.ac.cy/challenge.html.

References

  • Agaskar A, Lu YM (2013) A fast monte carlo algorithm for source localization on graphs. In: SPIE optical engineering and applications

  • Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge Unversity Press, New York

    Book  MATH  Google Scholar 

  • Boyd S, Cortes C, Mohri M, Radovanovic A (2012) Accuracy at the top. In: Advances in neural information processing systems, pp 962–970

  • Breiger RL, Pattison PE (1986) Cumulated social roles: the duality of persons and their algebras. Soc Netw 8(3):215–256

    Article  Google Scholar 

  • Char J (1968) Generation of trees, two-trees, and storage of master forests. IEEE Trans Circuit Theory 15(3):228–238

    Article  MathSciNet  Google Scholar 

  • Chen W, Wang Y, Yang S (2009) Efficient influence maximization in social networks. In: Proceedings of the annual ACM SIGKDD conference on knowledge discovery and data mining (KDD), pp 199–208

  • Chen W, Wang C, Wang Y (2010) Scalable influence maximization for prevalent viral marketing in large-scale social networks. In: Proceedings of the annual ACM SIGKDD conference on knowledge discovery and data mining (KDD), pp 1029–1038

  • Chen Z, Zhu K, Ying L (2014) Detecting multiple information sources in networks under the SIR model. In: Proceedings of the IEEE conference on information sciences and systems (CISS), Princeton

  • Dong W, Zhang W, Tan CW (2013) Rooting out the rumor culprit from suspects. In: Proceedings of the IEEE international symposium on information theory (ISIT), Istanbul, pp 2671–2675

  • Garey MR, Johnson DS (1979) Computers and intractibility: a guide to the theory of NP-completeness. Macmillan Higher Education, New York

    MATH  Google Scholar 

  • Goyal A, Lu W, Lakshmanan LVS (2011) Simpath: an efficient algorithm for influence maximization under the linear threshold model. In: IEEE international conference on data mining (ICDM). IEEE Computer Society, Washington, DC, pp 211–220

  • Gruhl D, Guha R, Liben-Nowell D, Tomkins A (2004) Information diffusion through blogspace. In: Proceedings of the international conference on World Wide Web (WWW), New York, pp 491–501

  • Gundecha P, Feng Z, Liu H (2013) Seeking provenance of information using social media. In: Proceedings of the ACM international conference on information knowledge management (CIKM), San Francisco, pp 1691–1696

  • Karamchandani N, Franceschetti M (2013) Rumor source detection under probabilistic sampling. In: Proceedings of the IEEE international symposium on information theory (ISIT), Istanbul

  • Kempe D, Kleinberg J, Tardos E (2003) Maximizing the spread of influence through a social network. In: Proceedings of the annual ACM SIGKDD conference on knowledge discovery and data mining (KDD), Washington DC, pp 137–146

  • Lappas T, Terzi E, Gunopulos D, Mannila H (2010) Finding effectors in social networks. In: Proceedings of the annual ACM SIGKDD conference on knowledge discovery and data mining (KDD), pp 1059–1068

  • Lokhov AY, Mezard M, Ohta H, Zdeborova L (2013) Inferring the origin of an epidemy with dynamic message-passing algorithm. arXiv:1303.5315, preprint

  • Luo W, Tay WP (2013) Finding an infection source under the SIS model. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP), Vancouver

  • Luo W, Tay WP, Leng M (2013) Identifying infection sources and regions in large networks. IEEE Trans Signal Process 61:2850–2865

    Article  MathSciNet  Google Scholar 

  • Luo W, Tay WP, Leng M (2014) How to identify an infection source with limited observations. IEEE J Sel Top Signal Process 8(4):586–597

    Article  Google Scholar 

  • Matsubara Y, Sakurai Y, Prakash BA, Li L, Faloutsos C (2012) Rise and fall patterns of information diffusion: model and implications. In: Proceedings of the annual ACM SIGKDD conference on knowledge discovery and data mining (KDD), Beijing, pp 6–14

  • Myers SA, Zhu C, Leskovec J (2012) Information diffusion and external influence in networks. In: Proceedings of the annual ACM SIGKDD conference on knowledge discovery and data mining (KDD), Beijing, pp 33–41

  • Nguyen DT, Nguyen NP, Thai MT (2012) Sources of misinformation in online social networks: who to suspect? In: Military communications conference, 2012-MILCOM 2012, IEEE, pp 1–6

  • Pinto PC, Thiran P, Vetterli M (2012) Locating the source of diffusion in large-scale networks. Phys Rev Lett 109(6):068,702

    Article  Google Scholar 

  • Prakash BA, Vreeken J, Faloutsos C (2012) Spotting culprits in epidemics: how many and which ones? In: IEEE international conference on Data Mining (ICDM), Brussels, pp 11–20

  • Sadikov E, Medina M, Leskovec J, Garcia-Molina H (2011) Correcting for missing data in information cascades. In: Proceedings of the fourth ACM international conference on web search and data mining, pp 55–64

  • Shah D, Zaman T (2011) Rumors in a network: who’s the culprit? IEEE Trans Inf Theory 57:5163–5181

    Article  MathSciNet  Google Scholar 

  • Shah D, Zaman T (2012) Rumor centrality: a universal source detector. ACM SIGMETRICS Perform Eval Rev 40(1):199–210

    Article  Google Scholar 

  • Snow J (1854) The cholera near Golden-square, and at Deptford. Med Times Gaz 9:321–322

    Google Scholar 

  • Wang Z, Dong W, Zhang W, Tan CW (2014) Rumor source detection with multiple observations: fundamental limits and algorithms. In: Proceedings of the annual ACM SIGMETRICS conference, Austin

  • Zejnilovic S, Gomes J, Sinopoli B (2013) Network observability and localization of the source of diffusion based on a subset of nodes. In: Proceedings of the annual Allerton conference on communication, control and computing, Monticello

  • Zhu K, Ying L (2013) Information source detection in the SIR model: a sample path based approach. In: Proceedings of information theory and applications workshop (ITA)

  • Zhu K, Ying L (2014) A robust information source estimator with sparse observations. In: Proceedings of the IEEE international conference on computer communications (INFOCOM), Toronto

Download references

Acknowledgments

This work was supported in part by the U.S. Army Research Laboratory’s Army Research Office (ARO Grant No. W911NF1310279).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kai Zhu.

Additional information

Responsible editors: Joao Gama, Indre Zliobaite, Alipio Jorge, and Concha Bielza.

Appendices

Appendix 1: Proof of Lemma 1

Define \(x_{k,k-1}=t_k-t_{k-1},\) so the cost C can be written as

$$\begin{aligned} C(\mathbf{x})=\sum _{k=2}^n(t_k-t_{k-1}-\mu )^2 =\sum _{k=2}^n(x_{k,k-1}-\mu )^2. \end{aligned}$$

The cost minimization problem can be written as

$$\begin{aligned}&\min C(\mathbf{x})=\sum _{k=2}^n(x_{k,k-1}-\mu )^2 \end{aligned}$$
(7)
$$\begin{aligned} \hbox {subject to:}&\sum \nolimits _{k=2}^n x_{k,k-1}=t_n-t_1 \end{aligned}$$
(8)
$$\begin{aligned}&x_{k,k-1}\ge 0. \end{aligned}$$
(9)

Note that \(C(\mathbf{x})\) is a convex function in \(\mathbf{x}.\) By verifying the KKT condition (Boyd and Vandenberghe 2004), it can be shown that the optimal solution to the problem above is \(x_{k,k-1}=\frac{\tau _n-\tau _1}{n-1},\) which implies \(t_k=\tau _1+(k-1)\frac{\tau _n-\tau _1}{n-1}.\)

Appendix 2: Proof of Theorem 1

Assume all nodes in the network are infected nodes and the infection time of two nodes (say Node v and Node w) are observed. Without loss of generality, assume \(\tau _v<\tau _w.\) Furthermore, assume the graph is undirected (i.e., all edges are bidirectional) and

$$\begin{aligned} |\tau _v-\tau _w|\ge \mu (|{\mathcal {I}}|-1). \end{aligned}$$

We will prove the theorem by showing that computing the cost of Node v is related to the longest path problem between Nodes v and w.

To compute C(v),  we consider those spreading trees rooted at Node v. Given a spreading tree \({\mathcal {P}}={{\mathcal {T}}, \mathbf{t}}\) rooted at Node v,  denote by \({\mathcal {Q}}(v,w)\) the set of edges on the path from Node v to Node w. The cost of the spreading tree can be written as

$$\begin{aligned} C({\mathcal {P}})&={\sum _{(h,u)\in {\mathcal {E}}({\mathcal {T}}) \backslash {\mathcal {Q}}(v,w)}(t_u -t_h-\mu )^2 } \end{aligned}$$
(10)
$$\begin{aligned}&\qquad + {\sum _{(h,u)\in {\mathcal {Q}}(v,w)}(t_u -t_h-\mu )^2} \end{aligned}$$
(11)

Recall that only the infection time of Nodes v and w are known. Furthermore, Nodes v and w will not both appear on a path in \({\mathcal {T}}\backslash {\mathcal {Q}}(v,w).\) Therefore, by choosing \(\tau _u-\tau _h=\mu \) for each \((h,u)\in {\mathcal {E}}({\mathcal {T}})\backslash {\mathcal {Q}}(v,w),\) we have

$$\begin{aligned} (10)=0. \end{aligned}$$

Next applying Lemma 1, we obtain that

$$\begin{aligned} (11)\ge & {} |{\mathcal {Q}}(v,w)|\left( \frac{\tau _w-\tau _v}{|{\mathcal {Q}}(v,w)|} -\mu \right) ^2, \end{aligned}$$
(12)

where the equality is achieved by assigning the timestamps according to Lemma 1.

For fixed \(|\tau _w-\tau _v|\) and \(\mu ,\) we have

$$\begin{aligned} \frac{\partial (12)}{\partial |{\mathcal {Q}}(v,w)|}&=\mu ^2-\left( \frac{\tau _w-\tau _v}{|{\mathcal {Q}}(v,w)|} \right) ^2\\&<_{(a)}\mu ^2-\left( \frac{\mu (|{\mathcal {I}}|-1)}{|{\mathcal {Q}}(v,w)|}\right) ^2\\&<_{(b)} \mu ^2-\left( \frac{\mu (|{\mathcal {I}}|-1)}{(|{\mathcal {I}}|-1)}\right) ^2=0, \end{aligned}$$

where inequality (a) holds because of the assumption \(\tau _w-\tau _v>\mu (|{\mathcal {I}}|-1)\) and inequality (b) is due to \(|{\mathcal {Q}}(v,w)|\le |{\mathcal {I}}|-1.\) So (12) is a decreasing function of \(|{\mathcal {Q}}(v,w)|\) (the length of the path).

Let \(\eta \) denote the length of the longest path between v and w. Given the longest path between v and w,  we can construct a spreading tree \({\mathcal {P}}^*\) by generating \({\mathcal {T}}^*\) using the breadth-first search starting from the longest path and assigning timestamps \(\mathbf{t}^*\) as mentioned above. Then,

$$\begin{aligned} C(v)=C({\mathcal {P}}^*)=\min _{{\mathcal {P}}_v\in {\mathcal {L}}({\mathcal {I}}, {\varvec{\tau }})}C({\mathcal {P}}_v)=\eta \left( \frac{\tau _w-\tau _v}{\eta }-\mu \right) ^2. \end{aligned}$$
(13)

Therefore, the algorithm that computes C(v) can be used to find the longest path between Nodes v and w. Since the longest path problem is NP-hard (Garey and Johnson 1979), the calculation of C(v) must also be NP-hard.

Appendix 3: Proof of Theorem 2

Note that the complexity of the modified breadth first search is \(O(|{\mathcal {E}}_I|)\) since each edge in the subgraph formed by the infected nodes only needs to be considered once. We next analyze the complexity of EIF:

  • Step 1 The complexity of computing the paths from an infected node to all other infected nodes is \(O(|{\mathcal {E}}_I|).\) Given \(|\alpha |\) infected nodes with timestamps, the computational complexity of Step 1 is \(O(|\alpha ||{\mathcal {E}}_I|).\)

  • Step 2 The complexity of sorting a list of size \(|\alpha |\) is \(O(|\alpha |\log (|\alpha |)).\)

  • Steps 3 and 4 To construct the spreading tree for a given node, \(|\alpha |\) infected nodes need to be attached in Steps 3 and 4. Each attachment requires the construction of a modified breadth-first tree, which has complexity \(O(|{\mathcal {E}}_I|).\) So the overall computational complexity of Steps 3 and 4 is \(O(|\alpha ||{\mathcal {E}}_I|).\)

  • Step 5 The breadth-first search algorithm is needed to complete the spreading tree, which has complexity \(O(|{\mathcal {E}}_I|).\)

From the discussion above, we can conclude that the computational complexity of constructing the spreading tree from a given node and calculating the associated cost is \(O(|\alpha ||{\mathcal {E}}_I|).\) CR (or TR) repeats EIF for each infected node, with complexity \(O(|\alpha ||{\mathcal {I}}||{\mathcal {E}}_I|),\) and then sort the infected nodes, with complexity \(O(|\mathcal {I}|\log |\mathcal {I}|).\) Therefore, the overall complexity of CR (or TR) is \(O(|\alpha ||{\mathcal {I}}||{\mathcal {E}}_I|).\)

Appendix 4: Additional experimental evaluation

In this section, we present additional experiments we conducted, including the comparison to Lappas’ algorithm under the IC model, the evaluation of the algorithms’ scalability and the evaluation using normalized rank.

1.1 Comparison to Lappas’ algorithm (Lappas et al. 2010)

In this section, we evaluate the performance of the algorithm in Lappas et al. (2010) (Lappas’ algorithm). Lappas’ algorithm was developed for the IC model and requires the infection probabilities of the IC model. Therefore, we only compared the algorithm in Lappas et al. (2010) on the IC model and the results are shown in Fig. 10. The experiments settings are the same as those in Sect. 5.2. We assume 50 % timestamps are observed for the TR, CR and GAU algorithms. As shown in Fig. 10, the \(\gamma \)%-accuracy of Lappas’ algorithm on the IAS network is significantly smaller than the TR and CR algorithms when \(\gamma \ge 10.\) In the PG network, the TR and CR algorithms dominates Lappas’ algorithm for all \(\gamma .\)

Fig. 10
figure 10

The performance comparison to the Lappas’ algorithm. a \(\gamma \)%-accuracy in the IAS network. b \(\gamma \)%-accuracy in the PG network

Fig. 11
figure 11

Execution time in the IAS network under the IC model. a Normalized rank versus computation time (50 % timestamps observed). b Time stamp size versus computation time

1.2 Scalability

We measured the execution time of the algorithms as shown in Fig. 11. The experiments are conducted on a Intel Core i5-3210M CPU with four cores and 8G RAM with a Windows 7 Professional 64 bit system. All algorithms are implemented with python 2.7. All the other settings are the same as those in Sect. 5.2 with \(\mu = 100.\) As shown in Fig. 11, CR and TR are more than six times faster than GAU when 50 % timestamps are observed. Although some other algorithms which do not use timestamps are faster, their performances are worse than TR, CR and GAU. Lappas’ algorithm is significantly slower than all the algorithms since Lappas’ algorithm is based on the full network while other algorithms are only based on the network with infected nodes or the neighbors of the infected nodes. In addition, as shown in Fig. 11b, the mean and the standard deviation of the running time of TR and CR are much smaller than those of GAU when the available timestamps are more than 10 %. Furthermore, the running time of TR and CR remains roughly the same as the number of timestamps increases while the running time of GAU increases significantly initially and then decreases a little bit. The decrease is because when more timestamps are observed, only the infected nodes with unobserved timestamps and the node which has the earliest observed timestamps could be the source which reduces the number of candidates hence the total running time.

1.3 Normalized rank

In addition to the \(\gamma \%\)-accuracy, we further evaluated the performance of the algorithms using the normalized rank, which is defined to be the ratio between the rank of the actual source and the total number of infected nodes. The observations are similar to the \(\gamma \%\)-accuracy except that CR performs better in the IAS network than TR in most cases and TR performs better in the PG network. The difference between GAU and TR and CR are smaller. The results show TR and CR not only achieve much better “accuracy-at-the-top”, but also improve the normalized rank in most cases. We next present a short summary for each set of simulations.

1.3.1 The impact of timestamp distribution

Table 6, 7, 8, 9, 10 and 11 show the normalized rank for the truncated Gaussian model for the IAS network and the PG network. The settings of the experiments are same as those in Sect. 5.3. In the IAS network, the CR algorithm yields the smallest normalized ranks and standard deviations when there are more than 10 % of timestamps are observed. In the PG network, TR yields the smallest normalized ranks and standard deviations.

Table 6 Normalized rank (mean \(\pm \) standard deviation) for different distributions and sizes of timestamps on the IAS network when \(\mu = 1\)
Table 7 Normalized rank (mean \(\pm \) standard deviation) for different distributions and sizes of timestamps on the IAS network when \(\mu = 10\)
Table 8 Normalized rank (mean \(\pm \) standard deviation) for different distributions and sizes of timestamps on the IAS network when \(\mu = 100\)
Table 9 Normalized rank (mean \(\pm \) standard deviation) for different distributions and sizes of timestamps on the PG network when \(\mu = 1\)
Table 10 Normalized rank (mean \(\pm \) standard deviation) for different distributions and sizes of timestamps on the PG network when \(\mu = 10\)
Table 11 Normalized rank (mean \(\pm \) standard deviation) for different distributions and sizes of timestamps on the PG network when \(\mu = 100\)
Table 12 Normalized rank (mean \(\pm \) standard deviation) for different distributions and sizes of timestamps on the IAS network under the IC model
Table 13 Normalized rank (mean \(\pm \) standard deviation) for different distributions and sizes of timestamps on the IAS network under the SpikeM model
Table 14 Normalized rank (mean \(\pm \) standard deviation) for different distributions and sizes of timestamps on the PG network under the IC model
Table 15 Normalized rank (mean \(\pm \) standard deviation) for different distributions and sizes of timestamps on the PG network under the SpikeM model
Table 16 Normalized rank (mean \(\pm \) standard deviation) as the number of removed edges increases in the IAS network
Table 17 Normalized rank for different tweet cascade sizes (mean \(\pm \) standard deviation) on the Weibo dataset

1.3.2 The impact of the diffusion model

Table 12, 13, 14 and 15 show the normalized rank under the IC model and SpikeM model. The settings are the same as that in Sect. 5.4. GAU has better or similar performance as TR and CR when the fraction of observed timestamps is small, but yields a larger normalized rank when the number of observed timestamps increases.

1.3.3 The impact of network topology

Table 16 shows the normalized rank when we remove the edges from the IAS network. The settings are the same as that in Sect. 5.5 and CR dominates in this case.

1.3.4 Weibo data evaluation

Table 17 shows the normalized rank for the Weibo data. The settings are the same as that in Sect. 5.6. We observed that the CR algorithm with 30 % timestamps has the minimum normalized rank for all tweet cascades sizes.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhu, K., Chen, Z. & Ying, L. Locating the contagion source in networks with partial timestamps. Data Min Knowl Disc 30, 1217–1248 (2016). https://doi.org/10.1007/s10618-015-0435-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-015-0435-9

Keywords

Navigation