Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Today, hundreds of millions of people use online social networks to access, discuss, produce and share content. These social networks now have an important impact on the way information travels worldwide. This has motived a large amount of research on the topic of information diffusion prediction: how can we predict which users will be infected by a given piece of information in the future? This “word of mouth” phenomenon has been widely studied over the last decade (see [8] for a comprehensive survey).

More recently, the problem of source detection has emerged. This is the opposite task: the goal is to retrieve which user started some diffusion episode, given the set of eventually infected users. In an epidemiological context, this is also known as the patient zero problem. For social media, the main application of this problem is to retrieve the source of some rumor, leak or disinformation, either to remove it from the network or to take legal action against it.

While several works have already studied this problem (see Sect. 2), they are all based on the assumption that the social graph on which diffusion takes place is either known or can be inferred, and that information diffusion follows some known propagation model such as the SI model [16, 20] or the NetRate model [5]. These turn out to be strong assumptions in most applications.

In this paper, we drop the aforementioned assumptions by using a Representation Learning approach to embed users in a latent space and use their representations to directly retrieve infection sources. This method does not require the influence graph to be known, and can be applied to partially observed diffusion episodes. Moreover, it allows us to easily consider the topic of the diffusion in concern, defining content-specific transformations of the representations of the users involved in the diffusion in concern. To the best of our knowledge, our approach is the first one to consider the content for source detection tasks. Our proposals are tested on real diffusion traces extracted from online social networks, something that is often missing in the literature of the field.

The rest of the paper is organized as follows. Section 2 reviews some related works and presents the motivations of our proposal. Section 3 introduces our model. Finally, Sect. 4 compares our model to various baselines.

2 Background and Motivations

While the topic of information diffusion prediction has been studied for a long time [8, 9, 18], source detection has been a subject of research for a few years.

As classically done in the field of diffusion modeling, existing approaches for source detection are based on the Susceptible-Infected framework defined on a given known graph of diffusion \(G=(\mathcal {U}, E)\). When a user \(u \in \mathcal {U}\) becomes infected at time t, each neighbor v in the graph becomes infected at time \(t+d_{u,v}\), with \(d_{u,v}\) being drawn from some delay distribution [4, 5, 16, 19, 20, 22]. The various methods mainly differ in their way of reversing the process of diffusion to predict the most probable source when some infections are observed.

The work of [20] was the first one to introduce the key concept of rumor centrality, a measure rendering the likelihood, for any content emitted from a node \(u \in \mathcal {U}\), to spread over a given subset of infected users \(\mathcal {U}' \subseteq \mathcal {U}\), knowing some diffusion relationships between them \(E' \subseteq E\). When a set \(\mathcal {U}'\) of infected users is observed at some time T, the source user can be estimated with a maximum likelihood approach:

$$\begin{aligned} s^* = \text {arg}\max _{s\,\in \,\mathcal {U}'} P(\mathcal {U}' | s) \propto \text {arg}\max _{s\,\in \,\mathcal {U}'} R(s,\mathcal {U}') \end{aligned}$$

where \(R(s,V')\) stands for the rumor centrality measure, applied to the source candidate \(s \in \mathcal {U}'\), which is computed by considering the number of possible sequences of infections of nodes from \(\mathcal {U}'\), that start with s and are consistent with the precedence graph defined by \(G'=(\mathcal {U}',E')\). This work was later extended in [4, 22] to optimize the estimation of R on more complex graph structures. All these works assume that one observes a complete snapshot of the network (infected nodes and edges) at some time T, and that the source is among these infected nodes. The infection times of each node is left unknown. Later, [19] proposed to consider a framework in which only the states of a subset of all users (called “monitors”) are observed, and compared various heuristics to select monitors and to retrieve rumor sources: reachability of infected nodes, distances to infected nodes in the graph, etc.

Some other works proposed to consider a framework in which we also observe when each node became infected. In this framework however, some infections (including the source one) remain unobserved (due for example to some API restrictions), and the goal is to retrieve the source node from the set of unobserved nodes \(\mathcal {U} \setminus \mathcal {U}'\). A first model was developed in [16], which is based on the assumption that transmission delays in the network follow a Gaussian distribution. The predicted source then corresponds to an unobserved node that maximizes the likelihood of the observed infection times. They proposed a heuristic based on the extraction of trees from the graph, similar to the one described in [20, 22]. Recently, [5] proposed a more precise approach based on previous works on information diffusion and link prediction [7], where transmission delays follow a exponential distribution. The computation of the likelihood of a source being difficult, a method based on importance sampling is employed.

Finally, the problem of multiple sources detection has also been addressed. In [11], the authors define the k-effectors problem. They assume that diffusion follows an Independent Cascades Model (IC) and look for the set of k sources X that minimizes the cost:

$$C(X) = \sum _{u_i\,\in \,U} |a(i) - \alpha (i,X)|$$

where a(i) indicates wether \(u_i\) is infected of not (1 or 0) and \(\alpha (i,X)\) is the probability for a user i to become infected when the source set is X. In other words, they minimize an \(\ell 1\) error. The minimization of C is shown to be NP-complete, so the authors study the problem on tree graphs, and propose a heuristic for this case. For general graphs, they suggest to extract a spanning tree and to apply the heuristic on that tree.

Another approach, NetSleuth, was proposed in [17]. It relies on the Minimum Description Length (MDL) principle. The authors propose an efficient method to describe the diffusion of an information (initial sources and list of all successive transmissions) in a minimum number of bits, assuming that the graph and the diffusion model are known.Given a set of infected users, they look for the set of sources and transmissions that minimizes the amount of bits required to be encoded. This approach is thus able to determine the number of sources as well as their identities.

While reasonable, considering the iterative diffusion process on a known graph faces the two following main limitations:

  • The performances for source detection are strongly dependent on the quality of the diffusion graph that is considered. However, the information about the diffusion graph is often missing, incomplete or irrelevant. Various methods, such as those proposed in [10] or [7], can learn the graph from a training set of diffusion episodes, but their effectiveness greatly depend on the representativeness of the available training data.

  • The estimation of the most probable source \(s^*\) usually requires to compute the shortest path between all user pairs in the graph, which is computationally expensive. For instance, the approach presented in [21] is \(\#P\)-hard.

Finally, these models have usually been tested on synthetic datasets only, with episodes generated using the diffusion model used in prediction. While this definitely gives important insights on their performances, results on real diffusion episodes are necessary to assess the efficiency of these approaches. For instance, [5] performed experiments on the memetracker dataset, and the results are very low compared to those obtained on synthetic datasets.

In this paper, we propose to embed users in a latent space and use distance between them to retrieve the source. This is related to the work of [2], which applied representation learning to the information propagation prediction task. Recently, representation learning has been used in various domains like playlist prediction [3] or language models [15]. These methods aim at projecting some items like songs, users or words in a latent, euclidean space so that relationships between them can be modeled with the distances computed in that latent space. Representation Learning has at least the following main advantages in the context of source detection:

  • Compression abilities offered by representation learning techniques enable the definition of more compact models, especially for dense social networks;

  • Diffusion relationships, which are encoded in a shared representation space, are regularized naturally: users with similar behaviors are likely to be projected near each other, and then tend to share some similar transmission tendencies with other users, which improves the ability of the model to generalize from sparse data;

  • A representation for the diffusion episode can be computed efficiently by combining individual representations of the infected users. This enables simple and fast source detection procedures;

  • The diffused content or any other additional information can be taken into account, by considering specific transformations of the diffusion representation.

Rather than reversing a given diffusion model as classically done, we thus propose to consider the use of such techniques for source detection tasks, by directly learning projections of users that lead to an efficient retrieval of diffusion sources.

3 Diffusion Source Detection

Let \(\mathcal {U}=\lbrace u_1,\ldots ,u_{N}\rbrace \) be a population of N users who communicate and exchange information. When some piece of information propagates in that population, we observe a diffusion episode, which corresponds to a sequence of infected users associated with their timestamps of infection:

$$ D=\lbrace (u_i,t_i), (u_j,t_j)... \rbrace $$

A diffusion episode can correspond, for instance, to a sequence of users who liked a specific video or retweeted a specific tweet. The first user of this sequence is the source user, and is denoted \(s_{D}\). In the following, we note \(\mathcal {U}_{D}\) the set of users infected in the episode D and \(\hat{\mathcal {U}}_{D}\) the same set but without the source user of D (i.e., \(\hat{\mathcal {U}}_{D}=\mathcal {U}_{D} \setminus \lbrace s_{D} \rbrace \)).

3.1 A Representation Learning Model

Our ultimate goal is to be able to retrieve the source \(s_D\) of a given diffusion episode D from which that source is missing (i.e. observed infected users are those belonging to \(\hat{\mathcal {U}}_{D}\)).

Basic Idea. To this end, our idea is to embed all users of the network in a latent space, by defining a representation \(z_i \in \mathbb {R}^d\) for every user \(u_i \in \mathcal {U}\), such that it is possible to predict the source user of an episode by looking at the relative locations of users in this space. With \(z_D\) a representation of the episode D, constructed as a function of individual representations of users in \(\hat{\mathcal {U}}_D\), we base our model on the following principle:

The representation of the source user \(s_{D}\) of any diffusion episode \(D \in \mathcal {D}\) should be located at the point \(z_{D}\) , which corresponds to the synthetic representation of the diffusion episode.

Following this principle, the episode representation \(z_D\) corresponds to an initial diffusion point from which the emitted content can spread to reach all infected users of the episode D. In this context, building a source prediction model corresponds to making the representation of \(s_D\) coincide with this continuous initial diffusion point \(z_D\). Like most representation techniques, we seek at defining a projection space where the similarities between user’s representations render some interactions propensities between them, which allows the model to leverage the behavior correlations of the users. Following this, we wish to set the initial diffusion point \(z_D\) as the closest possible point to every individual representation of users in \(\hat{\mathcal {U}}_D\), so that it can equivalently explains the set of all observed infections. Various functions \(\phi :2^N \rightarrow \mathbb {R}^d\) can hold this requirement to transform a set of individual representations of users in \(\hat{\mathcal {U}}_D\) to the episode representation \(z_D\). Another constraint is the low cost of computation. We therefore consider the following function \(\phi \), which corresponds to an averaged representation of infected users:

$$\begin{aligned} z_D=\phi (\hat{\mathcal {U}}_D) = \frac{1}{|\hat{\mathcal {U}}_D|} \sum _{u_i\,\in \,\hat{\mathcal {U}}_D} z_i \end{aligned}$$
(1)

Note that such a function \(\phi \) also presents the advantage to be rather stable w.r.t. missing infected users (i.e., when \(|\hat{\mathcal {U}}_D|\) is sufficiently large, \(\forall u \in \mathcal {U}: \phi (\hat{\mathcal {U}}_D \cup \{u\}) \approx \phi (\hat{\mathcal {U}}_D)\)), which allows the model to manipulate consistent representations in the case of incomplete observations of the episodes diffusions. An illustration of the targeted projection of a given observed diffusion episode is given in Fig. 1, where the source user of the episode D is projected at the center of the representations of infected user in \(\hat{\mathcal {U}}_D\).

Source Prediction Model. Now that the basic idea of our proposal is presented, we can define our model for source prediction based on it. To begin with, lets us note that learning one representation per user would lead to a symetric model, where the tendency of a user to be an information source would be equal to its tendency to be infected in a diffusion episode. This setting is not realistic, as diffusion is an asymmetric process: while some users are opinion leaders, others only reproduce some collected content. To include this observation in our model, we therefore consider two representations for each user \(u_i\): while the vector \(z_i\) embeds the behavior of \(u_i\) as a receiver of information, \(\omega _i\) embeds his behavior as a sender of content (i.e., a source user in our context, since we only consider transmissions from the source to all eventually infected users). Defining these two embeddings per user allows us to model asymmetric relationships.

To retrieve the source of the diffusion D given \(\hat{\mathcal {U}}_D\), the model then considers the user \(u_i\) whose sender embedding \(\omega _i\) is the closest to the synthetic representation of the episode \(z_D\):

$$\begin{aligned} s^\star = \mathop {{{\mathrm{\arg \!\min }}}}\limits _{u_i\,\in \,\mathcal {U} \setminus \hat{\mathcal {U}}_D} || \omega _i - z_D||^2 \end{aligned}$$
(2)

where \(z_D\) is computed by using formula 1 applied to users \(\hat{\mathcal {U}}_D\). In order to learn both sets of embedding \(\Omega =(\omega _i)_{\forall i\,\in \,\mathcal {U}}\) and \(\mathcal {Z}=(z_i)_{\forall i\,\in \,\mathcal {U}}\) so that formula 2 returns accurate sources of diffusion, we consider the following pairwise loss on the learning set of diffusion episodes \(\mathcal {D}\):

$$\begin{aligned} \mathcal {L}(\Omega ,\mathcal {Z})= \sum _{D\,\in \,\mathcal {D}} \sum _{u_i\,\notin \,\mathcal {U}_D} h \left( ||\omega _{i} - z_D||^2 - ||\omega _{s_D} - z_D||^2 \right) \end{aligned}$$
(3)

where h corresponds to the hingeloss function: \(h(x)=\max (1-x,0)\). This function is a pairwise ranking loss that follows the principle of the prediction function (formula 2). Basically, for our prediction function to be valid, we need the sender representation of the actual source to be closer to the representation of D (second term of the subtraction in h) than any other sender representation (first term of the subtraction), so that it is the one who would be predicted using formula 2.

This loss can easily be minimized by defining a stochastic gradient descent process, detailed in Algorithm 1. Intuitively, it can be summed up like this: first, we initialize all embeddings at random (lines 2 and 3). Then, at each iteration, we draw one episode D (line 6) and one “non-source” user \(u_j\) that is not in \(\mathcal {U}_D\) (line 7). If the embedding \(\omega _{s_D}\) of the actual source is not closer to the representation \(z_D\) than \(\omega _j\) by at least 1 (line 11), all relevant embeddings are updated with one gradient step (lines 12, 13 and 15). This gradient step moves the representation \(z_D\) toward \(\omega _{s_D}\) and away from \(\omega _j\). The learning goes on until convergence, which is tested by checking the variation of \(\mathcal {L}\) every set number of iterations (100000 in our case).

Fig. 1.
figure 1

From a diffusion episode tree to its projection in our representation space.

figure a

Regularization of Embeddings. In the loss defined above, two representation vectors are learned for each user to account for the difference of its behavior as a sender and a receiver [1]. While these two representations can be quite different, it is reasonable to think that they are not uncorrelated: both behaviors are consequences of the centers of interests of that user. To account for this correlation, we include a sender-receiver regularization term in the loss considered for the learning of the model:

$$\begin{aligned} \mathcal {L}(\Omega ,\mathcal {Z}) + \lambda \sum _{u_i}||\omega _i - z_i||^2 \end{aligned}$$
(4)

where the second term corresponds to the desired regularization weighted by an hyper-parameter \(\lambda \). This term favors embeddings such that \(\omega _i\) and \(z_i\) are close, and improves the generalization ability of the model. For instance, without this term, no embedding \(\omega _i\) for a user who never appears as a source in \(\mathcal {D}\) could be learned. With the regularization term that links the two representations \(\omega _i\) and \(z_i\), some information about \(z_i\) can be transferred on \(\omega _i\). This also prevents over-fitting.

3.2 Extensions

Inclusion of User Importance. One possible extension of our model is to learn an additional weight \(\alpha _i \in \mathbb {R}^+\) for each user in the training set and redefine \(z_D\) as:

$$ z_D = \sum _{u_i\,\in \,\hat{\mathcal {U}}_D} \dfrac{e^{\alpha _i} }{\sum _{u_j\,\in \,\hat{\mathcal {U}}_D}e^{\alpha _j}} z_i $$

where the fraction corresponds to the softmax function that allows to map a vector of k real values to \([0;1]^k\). This formulation corresponds to the computation of the barycenter of the representations of users in \(\hat{\mathcal {U}}_D\), with weights defined by relative values of the \(\alpha \) parameters of these users. These parameters therefore model the relative importance of each user to predict the source of the diffusion. For instance, on Twitter, some user \(u_i\) could happen to actually be a spamming bot that just reuse all popular hashtags in order to gain visibility and post ads. In that case, the infection of this user gives little to no information about the source, and the system will learn a weight \(\alpha _i \approx 0\). Beyond allowing the learning process to focus on more discriminant infections, and to discard users with very chaotic behaviors, it may also permit to select the most important users to select in a situation where only a subset of them can be simultaneously monitored [16].

Integration of Content. It is known that the content of a piece of information modifies the way it propagates [23]. For instance, two pieces of information shared by the same source, one about sports and the other about politics, will probably not spread to the same users. In this subsection, we propose a way to include it in the model. The content associated to an episode D is represented by some vector \(w_D \in \mathbb {R}^a\). Depending on the application, this vector may for instance be a bag-of-words extracted from text, or some visual features extracted from an image. We learn content transformation parameters \(\theta \in \mathbb {R}^{a\times d}\) that are used to map a given content to \(\mathbb {R}^d\) by a linear application \(<w_D,\theta>\). The resulting vector of this application is used to translate the episode representation \(z_D\), which implies content-specific modifications of the prediction model:

$$\begin{aligned} z_D=\frac{1}{|\hat{\mathcal {U}}_D|} \sum _{u_i\,\in \,\hat{\mathcal {U}}_D} z_i + <w_D,\theta > \end{aligned}$$
(5)

The parameters \(\theta \) are learned at the same time as user’s projection parameters, considering the optimization of the loss from formula 4 with this definition of translated representation \(z_D\). Note that other content specific transformations have been investigated, but this simple translation of \(z_D\) allowed us to observe the best results on a validation set.

4 Experiments

4.1 Datasets

The following datasets have been used:

 

Artificial. :

Diffusion episodes generated using the IC model [18] on a scale-free network of 100 users.

Lastfm. :

Dataset extracted from a music streaming website. Each diffusion episode gathers the users who listened to a given song.Footnote 1

Weibo. :

Retweet cascades extracted from the Weibo microbloging website using the procedure described in [12]. The dataset was collected by [6].

Twitter. :

Diffusion episodes of hashtags on Twitter, over a fixed population of about 5000 users during the US 2012 presidential campaign.

Each dataset was filtered to keep only a subset of about 5000 of its most active users. Table 1 gives some statistics on the datasets.

Table 1. Some statistics on the datasets: the number of users \(|\mathcal {U}|\), of links \(|\mathcal {E}|\) in the graph, of diffusion episodes in the training set, and the density of the graph.

4.2 Baselines

We compare our model to several graph-based baselines.

 

OutDeg: :

This simple baseline was used in [5]. First, we find all the “possible sources” i.e. all users who can reach every infected one through a series of hops in the graph. Then, we rank these possible sources by their out-degree, the higher one being the most likely source.

Jordan Center: :

The use of a Jordan Center as a source estimator was studied in [14]. Because our experimental context is not exactly the same as [14], we slightly adapt its formulation: the predicted source is the one with the minimum longest distance to any infected user.

Pinto’s: :

The model described in [16], based on the assumption that infection delays follow a Gaussian law. It uses a heuristic based on the extraction of a tree subgraph.

For all these approaches, the diffusion graph used is obtained by using the Expectation-Maximization procedure described in [10] to learn the parameters of an Independent Cascades Model. This returns a probability of transmission \(p_{i,j}\) for each pair of users. We then assume that a link \((u_i,u_j)\) exists in E if and only if the learned probability of transmission \(p_{i,j}\) is greater than S, where S is a threshold set empirically for each baseline to maximize its results on a validation set.

4.3 Experimental Contexts and Results

We now present the results obtained by all models on several experiments. We evaluate the ability of the models to retrieve the source on a testing set of diffusion episodes \(\mathcal {D}^\prime \) with a Top-K measure, for various values of K. The Top-K measure is computed by sorting users according to their “scores” (i.e. likelihood or distance to \(z_D\), depending on the model). If the actual source is among the K best-ranked users, the Top-K value is 1, otherwise it is 0.

Fig. 2.
figure 2

Time to convergence and source detection performance (Top-5 measure) for various values of d, on the Weibo Dataset

Choice of Latent Space Dimension. As a preliminary experiment, we study the effect of the number of dimensions of the latent space d. Figure 2 shows the time taken by our learning algorithm to converge for various values of d, and the performances obtained on the source detection task (described in the next subsection) on a validation set. Results are shown for the Weibo dataset, but we observed similar results on the other real datasets. We can see that while the time taken grows linearly with d, performances only grow a little for values of d beyond about 30. For these reasons, we use a value of \(d=30\) in all of our experiments.

Fig. 3.
figure 3

Source detection with full cascades. Top-K precision

Source Detection with Full Cascades. This is the regular experimental context: find \(s_D\) given \(\hat{\mathcal {U}}_D\). Results for our model (RL) are given for a regularization parameter \(\lambda =10^{-4}\) that appeared to lead to the best results on a validation set. Results are presented in Fig. 3.

Firstly, we can see that on the artificial dataset, our model and the Jordan center model obtain better results that the other two baselines. Let us remember that on this dataset, diffusion episodes are generated using an Independent Cascade model (IC). Since the dataset is very small and since the data have been generated following this diffusion model, IC easily retrieves the actual transmission channels between users from the training set of diffusion episodes. In this context, the Jordan Center heuristic, which is based on an exhaustive computation of the number of hops between nodes, can achieve good performances, which our model is able to match without the use of an external diffusion model. Pinto’s model, on the other hand, performs its calculation on a tree extracted from the graph with a Breath-First-Search, which ends up ignoring a lot of information and reducing its performances.

On the Weibo dataset, the IC algorithm cannot retrieve the real diffusion graph (to few data w.r.t. the complexity of the network and IC hypothesis not fully verified). Therefore, the performances of the Pinto model and the Jordan Center model are closer. Meanwhile, our approach outperforms every baseline, because it does not rely on any hypothesis about the graph structure or the diffusion model. The fact that Pinto’s ends up slightly below Jordan can be explained by the fact that Pinto’s makes the assumption that transmission delays follow a Gaussian distribution, which is unrealistic in real datasets [5].

Finally, Lastfm and Twitter are noiser datasets: the fact that two users listened to the same song or used the same hashtag does not always mean that one of them infected the other one, they might just happen to have similar centers of interest. In this context, infections may not be linked by causation but by correlation, which in turn could limit the relevance of the extracted graph. Since all the baselines are based on that graph, they exhibit poor performances on those datasets while our model outperforms them.

While the results of all approaches may appear to be rather low on Twitter, they can still be useful in some contexts like the one described in [13]: when the administrator of a network looking for the source of a rumor needs to decide which users to probe, any model that gives a non-trivial (which is the case here) result is important.

Fig. 4.
figure 4

Source detection on partial cascades (20 %). Top-K precision.

Source Detection on Partially Observed Cascades. As said in the introduction, diffusion episodes are often partially observed in real-world applications. We simulate this by removing random users from diffusion episodes in the testing set. For each diffusion episode D in \(\mathcal {D}^\prime \), we only keep a set percentage of \(\hat{\mathcal {U}} _D\). Results are presented in Fig. 4.

On the Artificial dataset, all models see a large drop in performance. Our approach ends up below the Jordan Centers heuristic, and on par with Pinto’s, which remains roughly at the same spot. Here, the superiority of the Jordan Center method can be explained by the fact that its shortest-path computation perfectly represents how information transits in an IC model. Also, because we use a scale-free random graph, a small number of observed users is enough to narrow down the number of possible sources in the graph.

On the other end, on the Weibo dataset, most models remain stable, and our approach stays superiors. Interestingly, results on Lastfm and Twitter are different. On the Lastfm dataset, outDeg ends up being better than the other baselines, while the Jordan center approach beats the other baselines on the Twitter dataset. On Lastfm long chains of diffusion are rather rare, as most of influences occur from one central user to a set of successors in the graph. This makes the out-degree of a user a good indicator of his tendency to be an “early adopter” which influences the whole set of infected users. On Twitter, longer chains of diffusion are observed, which results in better results for the Jordan centers heuristic, since it comes down to finding the source that minimizes the number hops required in the retweet graph to reach all infected users. Like on the Weibo dataset, Pinto’s method performance is poor because it heavily relies on the modeling of transmission delays, which are very chaotic and hard to capture on such a noisy dataset [10]. In the end, in both cases, our approach exhibit better performances.

Overall, we can see that graph models can have very different results depending on the dataset considered: the best one on a given dataset can be the worse one on another. Meanwhile, our approach achieves consistent and better results on the real datasets, thanks to the use of a latent space that makes it more robust to noise and sparsity.

Fig. 5.
figure 5

Source detection with partially observed cascades (20 %) during learning and testing.

Learning on Partially Observed Cascades. In the previous experiments, we assumed that we had access to complete diffusion episodes during the learning step. However, it may not be the case in real applications. The same reasons that can make one unable to observe full episodes during the inference step may also prevent us from collecting full episodes for learning. To study this case, we used the filtering procedure described in the previous experiment, and kept only 20 % of the infections contained in each diffusion episode from \(\mathcal {D}\) and \(\mathcal {D}^\prime \). Results are given in Fig. 5.

On most datasets, the relative performances of the models are similar to the ones obtained in the previous experiment, which is not surprising since the testing sets are the same. However, on the artificial dataset, our model clearly outperforms the Jordan baseline, contrary to the previous experiments. This is due to the fact that the influence graph inferred using \(\mathcal {D}\) is not perfect at all, since the diffusion episodes in \(\mathcal {D}\) are partial. This greatly reduces the quality of the Jordan model. Overall, our model also outperforms the baselines in such a setting.

Complexity. For the learning part, both our model and the graph-based approaches use a stochastic algorithm that roughly takes the same time to converge. However, our approach learns a fixed number of parameters for each user, which grows linearly with the number N of users, while the number of parameters for graph models is the number of links in the graph, which scales quadratically with N. Furthermore, during inference, our model is much faster: it usually takes less than a second to perform source detection for one episode, while baselines require minutes. We only need to compute the episode representation \(z_D\) and its distance to all possible sources, which has a linear complexity. The graph models usually need to compute shortest distances between all users in the graph, which is much more complex. This makes our approach more scalable, which is an important issue when dealing with large online social networks.

Inclusion of User Importance. In this subsection, we test the extension defined in Sect. 3.2. We compare the results of the model with weights to those of the base version, on the real datasets. Results are presented in Table 2. We can see that on the Twitter dataset, using weights improves our results by about 10 %. This is due to the fact that Twitter is a widely used social network and thus a very noisy dataset. Learning relative importance weights for users enables our model to limit the impact of users whose behavior disrupt the prediction process. We observe a similar effect on the Lastfm dataset. On the Weibo dataset, however, results are not better when learning such weights, which might indicate that users are more homogeneous on this dataset. This can be verified by looking at the variance of the alpha values. We measure a variance of 0.12 and 0.15 on the Twitter and Lastfm datasets, respectively. Meanwhile, this variance on Weibo is only 0.08. These results may lead to interesting possibilities in the task of monitors selections, i.e. finding the M best users to monitor in order to achieve the best possible source detection [19]. In our case, this would come down to selecting the M users with the highest weights \(\alpha \).

Table 2. Source Detection with users weights. Models are only tested on diffusion episodes of length 3 or higher (both models are equivalent on diffusions episodes of length 2)

Integration of Content. We now test the content-aware model extension described in Sect. 3.2. This version of the model was only tested on the Twitter dataset. We extracted the content of all episodes by using a bag-of-word representation of the tweets they contain. The dictionary is filtered to keep only the most 2000 frequent words. Each word is associated to an integer between 0 and 1999, and the representation of the content is a vector in \(\mathbb {N}^{2000}\) indicating the number of occurrences of each word in the tweets. The data collection was limited to tweets written in english, but the approach would remain valid for other languages. Results are presented in Table 3. We can see that the integration of the content greatly improves our prediction, especially in Top-1.

Table 3. Source detection with content integration. Tested on the Twitter dataset

Since we use a bag-of-words representation of size 2000 and a linear projection of that content into a d-dimensional space, we learned a \(2000\times d\) projection matrix (the parameters \(\theta \)) whose rows can be interpreted as representation of the words in \(\mathbb {R}^d\). Table 4 lists the ten words with the largest representation norms. We can see that, apart from “new” and “retweet”, all listed words are meaningful ones which greatly inform about the topic of the diffusion.

Table 4. Top 10 most important words according to our content-based model.
Table 5. Pair of words with the highest cosine similarities of their representations

Furthermore, words with similar representations should tend to have similar effects on the diffusion. To verify this, we show in Table 5 the pairs of words with the highest cosine similarities. We can see that these pairs indeed either correspond to words with similar meanings (leisur/getawai and iran/iranian) or words used in similar contexts. OpESR (Operation Empire State Rebellion) and OccupyHQ are pages used by activists from the “Occupy Wall Street” movement. “Masen” and “Mapoli” stand for “Massachusetts Senate” and “Massachusetts Politics”.

5 Conclusion

In this paper, we proposed a novel and efficient method to retrieve the sources of information from diffusion episodes. This method is based on the use of a latent space that embeds the influences and similarities between users. We tested this approach on artificial and real diffusion episodes, and found that it achieved better performances than state of the art approaches, while retaining a lower complexity. We also proposed a way to learn the importance of each user and to integrate the content of information in the model, and showed that both led to performance improvements. Ongoing works are focused on a unifying framework of cascade completion: how can we retrieve a whole diffusion episode given only a fraction of its users? Source detection and diffusion prediction are special cases of this task. An unifying model, fitted for this more general problem, would give important insights on the nature and dynamics of diffusion.