Locating the contagion source in networks with partial timestamps

Zhu, Kai; Chen, Zhen; Ying, Lei

doi:10.1007/s10618-015-0435-9

Locating the contagion source in networks with partial timestamps

Published: 16 September 2015

Volume 30, pages 1217–1248, (2016)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Kai Zhu¹,
Zhen Chen¹ &
Lei Ying¹

830 Accesses
22 Citations
Explore all metrics

Abstract

This paper studies the problem of identifying a single contagion source when partial timestamps of a contagion process are available. We formulate the source localization problem as a ranking problem on graphs, where infected nodes are ranked according to their likelihood of being the source. Two ranking algorithms, cost-based ranking and tree-based ranking, are proposed in this paper. Experimental evaluations with synthetic and real-world data show that our algorithms significantly improve the ranking accuracy compared with four existing algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A new local and multidimensional ranking measure to detect spreaders in social networks

Article 27 November 2018

Kamal Berahmand, Asgarali Bouyer & Negin Samadi

Fast rumor source identification via random walks

Article 22 August 2016

Alankar Jain, Vivek Borkar & Dinesh Garg

Finding Outbreak Trees in Networks with Limited Information

Article 17 April 2015

David Rey, Lauren Gardner & S. Travis Waller

Notes

http://www.weibo.com/.
Available at http://snap.stanford.edu/data/index.html.
Available at http://www-personal.umich.edu/~mejn/netdata/.
http://www.weibo.com/.
http://www.wise2012.cs.ucy.ac.cy/challenge.html.

References

Agaskar A, Lu YM (2013) A fast monte carlo algorithm for source localization on graphs. In: SPIE optical engineering and applications
Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge Unversity Press, New York
Book MATH Google Scholar
Boyd S, Cortes C, Mohri M, Radovanovic A (2012) Accuracy at the top. In: Advances in neural information processing systems, pp 962–970
Breiger RL, Pattison PE (1986) Cumulated social roles: the duality of persons and their algebras. Soc Netw 8(3):215–256
Article Google Scholar
Char J (1968) Generation of trees, two-trees, and storage of master forests. IEEE Trans Circuit Theory 15(3):228–238
Article MathSciNet Google Scholar
Chen W, Wang Y, Yang S (2009) Efficient influence maximization in social networks. In: Proceedings of the annual ACM SIGKDD conference on knowledge discovery and data mining (KDD), pp 199–208
Chen W, Wang C, Wang Y (2010) Scalable influence maximization for prevalent viral marketing in large-scale social networks. In: Proceedings of the annual ACM SIGKDD conference on knowledge discovery and data mining (KDD), pp 1029–1038
Chen Z, Zhu K, Ying L (2014) Detecting multiple information sources in networks under the SIR model. In: Proceedings of the IEEE conference on information sciences and systems (CISS), Princeton
Dong W, Zhang W, Tan CW (2013) Rooting out the rumor culprit from suspects. In: Proceedings of the IEEE international symposium on information theory (ISIT), Istanbul, pp 2671–2675
Garey MR, Johnson DS (1979) Computers and intractibility: a guide to the theory of NP-completeness. Macmillan Higher Education, New York
MATH Google Scholar
Goyal A, Lu W, Lakshmanan LVS (2011) Simpath: an efficient algorithm for influence maximization under the linear threshold model. In: IEEE international conference on data mining (ICDM). IEEE Computer Society, Washington, DC, pp 211–220
Gruhl D, Guha R, Liben-Nowell D, Tomkins A (2004) Information diffusion through blogspace. In: Proceedings of the international conference on World Wide Web (WWW), New York, pp 491–501
Gundecha P, Feng Z, Liu H (2013) Seeking provenance of information using social media. In: Proceedings of the ACM international conference on information knowledge management (CIKM), San Francisco, pp 1691–1696
Karamchandani N, Franceschetti M (2013) Rumor source detection under probabilistic sampling. In: Proceedings of the IEEE international symposium on information theory (ISIT), Istanbul
Kempe D, Kleinberg J, Tardos E (2003) Maximizing the spread of influence through a social network. In: Proceedings of the annual ACM SIGKDD conference on knowledge discovery and data mining (KDD), Washington DC, pp 137–146
Lappas T, Terzi E, Gunopulos D, Mannila H (2010) Finding effectors in social networks. In: Proceedings of the annual ACM SIGKDD conference on knowledge discovery and data mining (KDD), pp 1059–1068
Lokhov AY, Mezard M, Ohta H, Zdeborova L (2013) Inferring the origin of an epidemy with dynamic message-passing algorithm. arXiv:1303.5315, preprint
Luo W, Tay WP (2013) Finding an infection source under the SIS model. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP), Vancouver
Luo W, Tay WP, Leng M (2013) Identifying infection sources and regions in large networks. IEEE Trans Signal Process 61:2850–2865
Article MathSciNet Google Scholar
Luo W, Tay WP, Leng M (2014) How to identify an infection source with limited observations. IEEE J Sel Top Signal Process 8(4):586–597
Article Google Scholar
Matsubara Y, Sakurai Y, Prakash BA, Li L, Faloutsos C (2012) Rise and fall patterns of information diffusion: model and implications. In: Proceedings of the annual ACM SIGKDD conference on knowledge discovery and data mining (KDD), Beijing, pp 6–14
Myers SA, Zhu C, Leskovec J (2012) Information diffusion and external influence in networks. In: Proceedings of the annual ACM SIGKDD conference on knowledge discovery and data mining (KDD), Beijing, pp 33–41
Nguyen DT, Nguyen NP, Thai MT (2012) Sources of misinformation in online social networks: who to suspect? In: Military communications conference, 2012-MILCOM 2012, IEEE, pp 1–6
Pinto PC, Thiran P, Vetterli M (2012) Locating the source of diffusion in large-scale networks. Phys Rev Lett 109(6):068,702
Article Google Scholar
Prakash BA, Vreeken J, Faloutsos C (2012) Spotting culprits in epidemics: how many and which ones? In: IEEE international conference on Data Mining (ICDM), Brussels, pp 11–20
Sadikov E, Medina M, Leskovec J, Garcia-Molina H (2011) Correcting for missing data in information cascades. In: Proceedings of the fourth ACM international conference on web search and data mining, pp 55–64
Shah D, Zaman T (2011) Rumors in a network: who’s the culprit? IEEE Trans Inf Theory 57:5163–5181
Article MathSciNet Google Scholar
Shah D, Zaman T (2012) Rumor centrality: a universal source detector. ACM SIGMETRICS Perform Eval Rev 40(1):199–210
Article Google Scholar
Snow J (1854) The cholera near Golden-square, and at Deptford. Med Times Gaz 9:321–322
Google Scholar
Wang Z, Dong W, Zhang W, Tan CW (2014) Rumor source detection with multiple observations: fundamental limits and algorithms. In: Proceedings of the annual ACM SIGMETRICS conference, Austin
Zejnilovic S, Gomes J, Sinopoli B (2013) Network observability and localization of the source of diffusion based on a subset of nodes. In: Proceedings of the annual Allerton conference on communication, control and computing, Monticello
Zhu K, Ying L (2013) Information source detection in the SIR model: a sample path based approach. In: Proceedings of information theory and applications workshop (ITA)
Zhu K, Ying L (2014) A robust information source estimator with sparse observations. In: Proceedings of the IEEE international conference on computer communications (INFOCOM), Toronto

Download references

Acknowledgments

This work was supported in part by the U.S. Army Research Laboratory’s Army Research Office (ARO Grant No. W911NF1310279).

Author information

Authors and Affiliations

School of Electrical, Computer and Energy Engineering, Arizona State University, Tempe, AZ, 85287, USA
Kai Zhu, Zhen Chen & Lei Ying

Authors

Kai Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Zhen Chen
View author publications
You can also search for this author in PubMed Google Scholar
Lei Ying
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kai Zhu.

Additional information

Responsible editors: Joao Gama, Indre Zliobaite, Alipio Jorge, and Concha Bielza.

Appendices

Appendix 1: Proof of Lemma 1

Define $x_{k,k-1}=t_k-t_{k-1},$ so the cost C can be written as

$$\begin{aligned} C(\mathbf{x})=\sum _{k=2}^n(t_k-t_{k-1}-\mu )^2 =\sum _{k=2}^n(x_{k,k-1}-\mu )^2. \end{aligned}$$

The cost minimization problem can be written as

$$\begin{aligned}&\min C(\mathbf{x})=\sum _{k=2}^n(x_{k,k-1}-\mu )^2 \end{aligned}$$

(7)

$$\begin{aligned} \hbox {subject to:}&\sum \nolimits _{k=2}^n x_{k,k-1}=t_n-t_1 \end{aligned}$$

(8)

$$\begin{aligned}&x_{k,k-1}\ge 0. \end{aligned}$$

(9)

Note that $C(\mathbf{x})$ is a convex function in $\mathbf{x}.$ By verifying the KKT condition (Boyd and Vandenberghe 2004), it can be shown that the optimal solution to the problem above is $x_{k,k-1}=\frac{\tau _n-\tau _1}{n-1},$ which implies $t_k=\tau _1+(k-1)\frac{\tau _n-\tau _1}{n-1}.$

Appendix 2: Proof of Theorem 1

Assume all nodes in the network are infected nodes and the infection time of two nodes (say Node v and Node w) are observed. Without loss of generality, assume $\tau _v<\tau _w.$ Furthermore, assume the graph is undirected (i.e., all edges are bidirectional) and

$$\begin{aligned} |\tau _v-\tau _w|\ge \mu (|{\mathcal {I}}|-1). \end{aligned}$$

We will prove the theorem by showing that computing the cost of Node v is related to the longest path problem between Nodes v and w.

To compute C(v), we consider those spreading trees rooted at Node v. Given a spreading tree ${\mathcal {P}}={{\mathcal {T}}, \mathbf{t}}$ rooted at Node v, denote by ${\mathcal {Q}}(v,w)$ the set of edges on the path from Node v to Node w. The cost of the spreading tree can be written as

$$\begin{aligned} C({\mathcal {P}})&={\sum _{(h,u)\in {\mathcal {E}}({\mathcal {T}}) \backslash {\mathcal {Q}}(v,w)}(t_u -t_h-\mu )^2 } \end{aligned}$$

(10)

$$\begin{aligned}&\qquad + {\sum _{(h,u)\in {\mathcal {Q}}(v,w)}(t_u -t_h-\mu )^2} \end{aligned}$$

(11)

Recall that only the infection time of Nodes v and w are known. Furthermore, Nodes v and w will not both appear on a path in ${\mathcal {T}}\backslash {\mathcal {Q}}(v,w).$ Therefore, by choosing $\tau _u-\tau _h=\mu $ for each $(h,u)\in {\mathcal {E}}({\mathcal {T}})\backslash {\mathcal {Q}}(v,w),$ we have

$$\begin{aligned} (10)=0. \end{aligned}$$

Next applying Lemma 1, we obtain that

$$\begin{aligned} (11)\ge & {} |{\mathcal {Q}}(v,w)|\left( \frac{\tau _w-\tau _v}{|{\mathcal {Q}}(v,w)|} -\mu \right) ^2, \end{aligned}$$

(12)

where the equality is achieved by assigning the timestamps according to Lemma 1.

For fixed $|\tau _w-\tau _v|$ and $\mu ,$ we have

$$\begin{aligned} \frac{\partial (12)}{\partial |{\mathcal {Q}}(v,w)|}&=\mu ^2-\left( \frac{\tau _w-\tau _v}{|{\mathcal {Q}}(v,w)|} \right) ^2\\&<_{(a)}\mu ^2-\left( \frac{\mu (|{\mathcal {I}}|-1)}{|{\mathcal {Q}}(v,w)|}\right) ^2\\&<_{(b)} \mu ^2-\left( \frac{\mu (|{\mathcal {I}}|-1)}{(|{\mathcal {I}}|-1)}\right) ^2=0, \end{aligned}$$

where inequality (a) holds because of the assumption $\tau _w-\tau _v>\mu (|{\mathcal {I}}|-1)$ and inequality (b) is due to $|{\mathcal {Q}}(v,w)|\le |{\mathcal {I}}|-1.$ So (12) is a decreasing function of $|{\mathcal {Q}}(v,w)|$ (the length of the path).

Let $\eta $ denote the length of the longest path between v and w. Given the longest path between v and w, we can construct a spreading tree ${\mathcal {P}}^*$ by generating ${\mathcal {T}}^*$ using the breadth-first search starting from the longest path and assigning timestamps $\mathbf{t}^*$ as mentioned above. Then,

$$\begin{aligned} C(v)=C({\mathcal {P}}^*)=\min _{{\mathcal {P}}_v\in {\mathcal {L}}({\mathcal {I}}, {\varvec{\tau }})}C({\mathcal {P}}_v)=\eta \left( \frac{\tau _w-\tau _v}{\eta }-\mu \right) ^2. \end{aligned}$$

(13)

Therefore, the algorithm that computes C(v) can be used to find the longest path between Nodes v and w. Since the longest path problem is NP-hard (Garey and Johnson 1979), the calculation of C(v) must also be NP-hard.

Appendix 3: Proof of Theorem 2

Note that the complexity of the modified breadth first search is $O(|{\mathcal {E}}_I|)$ since each edge in the subgraph formed by the infected nodes only needs to be considered once. We next analyze the complexity of EIF:

Step 1 The complexity of computing the paths from an infected node to all other infected nodes is $O(|{\mathcal {E}}_I|).$ Given $|\alpha |$ infected nodes with timestamps, the computational complexity of Step 1 is $O(|\alpha ||{\mathcal {E}}_I|).$
Step 2 The complexity of sorting a list of size $|\alpha |$ is $O(|\alpha |\log (|\alpha |)).$
Steps 3 and 4 To construct the spreading tree for a given node, $|\alpha |$ infected nodes need to be attached in Steps 3 and 4. Each attachment requires the construction of a modified breadth-first tree, which has complexity $O(|{\mathcal {E}}_I|).$ So the overall computational complexity of Steps 3 and 4 is $O(|\alpha ||{\mathcal {E}}_I|).$
Step 5 The breadth-first search algorithm is needed to complete the spreading tree, which has complexity $O(|{\mathcal {E}}_I|).$

From the discussion above, we can conclude that the computational complexity of constructing the spreading tree from a given node and calculating the associated cost is $O(|\alpha ||{\mathcal {E}}_I|).$ CR (or TR) repeats EIF for each infected node, with complexity $O(|\alpha ||{\mathcal {I}}||{\mathcal {E}}_I|),$ and then sort the infected nodes, with complexity $O(|\mathcal {I}|\log |\mathcal {I}|).$ Therefore, the overall complexity of CR (or TR) is $O(|\alpha ||{\mathcal {I}}||{\mathcal {E}}_I|).$

Appendix 4: Additional experimental evaluation

In this section, we present additional experiments we conducted, including the comparison to Lappas’ algorithm under the IC model, the evaluation of the algorithms’ scalability and the evaluation using normalized rank.

1.1 Comparison to Lappas’ algorithm (Lappas et al. 2010)

In this section, we evaluate the performance of the algorithm in Lappas et al. (2010) (Lappas’ algorithm). Lappas’ algorithm was developed for the IC model and requires the infection probabilities of the IC model. Therefore, we only compared the algorithm in Lappas et al. (2010) on the IC model and the results are shown in Fig. 10. The experiments settings are the same as those in Sect. 5.2. We assume 50 % timestamps are observed for the TR, CR and GAU algorithms. As shown in Fig. 10, the $\gamma $%-accuracy of Lappas’ algorithm on the IAS network is significantly smaller than the TR and CR algorithms when $\gamma \ge 10.$ In the PG network, the TR and CR algorithms dominates Lappas’ algorithm for all $\gamma .$

1.2 Scalability

We measured the execution time of the algorithms as shown in Fig. 11. The experiments are conducted on a Intel Core i5-3210M CPU with four cores and 8G RAM with a Windows 7 Professional 64 bit system. All algorithms are implemented with python 2.7. All the other settings are the same as those in Sect. 5.2 with $\mu = 100.$ As shown in Fig. 11, CR and TR are more than six times faster than GAU when 50 % timestamps are observed. Although some other algorithms which do not use timestamps are faster, their performances are worse than TR, CR and GAU. Lappas’ algorithm is significantly slower than all the algorithms since Lappas’ algorithm is based on the full network while other algorithms are only based on the network with infected nodes or the neighbors of the infected nodes. In addition, as shown in Fig. 11b, the mean and the standard deviation of the running time of TR and CR are much smaller than those of GAU when the available timestamps are more than 10 %. Furthermore, the running time of TR and CR remains roughly the same as the number of timestamps increases while the running time of GAU increases significantly initially and then decreases a little bit. The decrease is because when more timestamps are observed, only the infected nodes with unobserved timestamps and the node which has the earliest observed timestamps could be the source which reduces the number of candidates hence the total running time.

1.3 Normalized rank

In addition to the $\gamma \%$-accuracy, we further evaluated the performance of the algorithms using the normalized rank, which is defined to be the ratio between the rank of the actual source and the total number of infected nodes. The observations are similar to the $\gamma \%$-accuracy except that CR performs better in the IAS network than TR in most cases and TR performs better in the PG network. The difference between GAU and TR and CR are smaller. The results show TR and CR not only achieve much better “accuracy-at-the-top”, but also improve the normalized rank in most cases. We next present a short summary for each set of simulations.

1.3.1 The impact of timestamp distribution

Table 6, 7, 8, 9, 10 and 11 show the normalized rank for the truncated Gaussian model for the IAS network and the PG network. The settings of the experiments are same as those in Sect. 5.3. In the IAS network, the CR algorithm yields the smallest normalized ranks and standard deviations when there are more than 10 % of timestamps are observed. In the PG network, TR yields the smallest normalized ranks and standard deviations.

Table 6 Normalized rank (mean $\pm $ standard deviation) for different distributions and sizes of timestamps on the IAS network when $\mu = 1$

Full size table

Table 7 Normalized rank (mean $\pm $ standard deviation) for different distributions and sizes of timestamps on the IAS network when $\mu = 10$

Full size table

Table 8 Normalized rank (mean $\pm $ standard deviation) for different distributions and sizes of timestamps on the IAS network when $\mu = 100$

Full size table

Table 9 Normalized rank (mean $\pm $ standard deviation) for different distributions and sizes of timestamps on the PG network when $\mu = 1$

Full size table

Table 10 Normalized rank (mean $\pm $ standard deviation) for different distributions and sizes of timestamps on the PG network when $\mu = 10$

Full size table

Table 11 Normalized rank (mean $\pm $ standard deviation) for different distributions and sizes of timestamps on the PG network when $\mu = 100$

Full size table

Table 12 Normalized rank (mean $\pm $ standard deviation) for different distributions and sizes of timestamps on the IAS network under the IC model

Full size table

Table 13 Normalized rank (mean $\pm $ standard deviation) for different distributions and sizes of timestamps on the IAS network under the SpikeM model

Full size table

Table 14 Normalized rank (mean $\pm $ standard deviation) for different distributions and sizes of timestamps on the PG network under the IC model

Full size table

Table 15 Normalized rank (mean $\pm $ standard deviation) for different distributions and sizes of timestamps on the PG network under the SpikeM model

Full size table

Table 16 Normalized rank (mean $\pm $ standard deviation) as the number of removed edges increases in the IAS network

Full size table

Table 17 Normalized rank for different tweet cascade sizes (mean $\pm $ standard deviation) on the Weibo dataset

Full size table

1.3.2 The impact of the diffusion model

Table 12, 13, 14 and 15 show the normalized rank under the IC model and SpikeM model. The settings are the same as that in Sect. 5.4. GAU has better or similar performance as TR and CR when the fraction of observed timestamps is small, but yields a larger normalized rank when the number of observed timestamps increases.

1.3.3 The impact of network topology

Table 16 shows the normalized rank when we remove the edges from the IAS network. The settings are the same as that in Sect. 5.5 and CR dominates in this case.

1.3.4 Weibo data evaluation

Table 17 shows the normalized rank for the Weibo data. The settings are the same as that in Sect. 5.6. We observed that the CR algorithm with 30 % timestamps has the minimum normalized rank for all tweet cascades sizes.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhu, K., Chen, Z. & Ying, L. Locating the contagion source in networks with partial timestamps. Data Min Knowl Disc 30, 1217–1248 (2016). https://doi.org/10.1007/s10618-015-0435-9

Download citation

Received: 28 January 2015
Accepted: 26 August 2015
Published: 16 September 2015
Issue Date: September 2016
DOI: https://doi.org/10.1007/s10618-015-0435-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Locating the contagion source in networks with partial timestamps

Abstract

Access this article

Similar content being viewed by others

A new local and multidimensional ranking measure to detect spreaders in social networks

Fast rumor source identification via random walks

Finding Outbreak Trees in Networks with Limited Information

Notes

References

Acknowledgments