Keywords

1 Introduction

At present, social networks are widely spread. This type of communication allows you to quickly disseminate information through personal messages, posts on the wall in the profile and group, etc. For example, the popular Russian social network VK has an audience of more than 97 million people every month, users send about 5 billion messages every day [1]. The structure of cyberspace like a social network can be represented by a complex network, whose nodes are entities represented as users and/or communities. The relationship between individual entities reflects the interests of a user and can be represented as permanent and temporal links of type “user-user” or “user-community”.

Existing models of information spread in networks are oriented to the reproduction and forecasting of aggregated dynamics of reactions and patterns of spread (e.g. cascade structures) and do not take into account the information-psychological influence on specific users. Nevertheless, the reaction of each user is unique and is determined by the conditions of access to information, the content of the information message, the internal state of the user, the history of his or her interaction with the source of information, and so on.

On the other hand, a completely individualized approach is impossible because of the rarity of the observed reactions (social networks contain a large amount of information about the population, but a small amount for specific users), and the impossibility to distinguish and to identify all factors that determine user reaction.

The solution in this situation can be the use of surrogate models that can be identified by data for different types of social network entities, different roles relative to the source of information, different profile parameters, different levels of involvement, etc. The combination of these models allows to provide a sufficient level of personalization and adjustment for the specific context of the social network where the process is modelled (in this study, the context is a community dedicated to charity) and to reproduce the aggregated dynamics in a proper way.

In this paper, we present a general approach to creating and combining different types of data-driven models (message generation, daily activity, response to messages, message transmission) and demonstrate their applicability by simulating reaction dynamics for a large charity community in a social network. To make our framework suitable for handling social networks of enormous size (e.g. a single video blogger may have millions of subscribers), we also provide a parallel implementation of the simulation algorithm.

The rest of the paper is organized as follows. Some background information and related works are given in Sect. 2. Section 3 presents simulation framework, description of different models and details of parallel implementation. In Sect. 4, we describe the dataset consisting information about charity community in vk.com, discuss the identification of model parameters and demonstrate the results of the simulation. Section 5 presents conclusions and a discussion of further research.

2 Related Works

Models of information spreading in cyberspace can be classified according to several distinctive features: (i) way of describing the process of information spread (network, multi-agent, parametric models), (ii) way of presenting an entity (e.g. binary opinion, states in SIR model, profile in a social network), (iii) way of presenting an information message (IM), (iv) the goal – explanatory or predictive model, (v) observables – parameters of cascade, number of reactions of different types or distribution of nodes in different states. The extensive review of the research methods and techniques on information diffusion can be found in [2].

Models of opinion dynamics are aimed to study the evolution of groups of nodes with different inner states (representing the opinion of a given topic) in a networked virtual society. For example, [3] describes the emergence of consensus and polarization as a result of evolution of opinions, [4] describes polarization of coalitions in an agent-based model of political discourse. Typically, these models use synthetic networks (Barabasi-Albert, Watts-Strogatz) to define connections between nodes. A state of a node is updated with simple rules accounting states of neighbors (e.g. majority rule [5]). Recent studies in this field investigate coevolution of opinions and network structure. The example of such approach is given in [6] where agents with a similar opinion may create community and may tend to remove connection with an agent having a significantly different opinion.

A vast amount of works is devoted to the study of the dynamics of reposts and their cascades in social networks. In [7], authors propose RepostsTree model to represent temporal dynamics of reposts in a hierarchical fashion for microblogging service Sina Weibo. Another article [8] is devoted to the Random Recursive Tree (RRT) for modeling the cascade tree topologies. Authors studied key features of a cascade tree like the average path length and degree variance in relation to a size of the tree.

Another strand of literature uses multi-agent approach where explicit rules for processing information by agents and interaction of agents are given. As an agent, different entities of cyberspace may be considered. In the first approach, messages appear to be the agents [9]. Each agent is characterized by an energy that changes during the life cycle, the energy increases in case of the repost of the message and decreases every iteration until it becomes equal to 0. In the second approach, users are selected as the agents. Authors of [10] propose Hashkat – agent-based software package for simulation of large-scale social networks. Each agent has some internal characteristics like language, region, ideology, creation time and following and follower set. The Hashkat’s aim is to study the growth of social networks and information flows in them. However, it does not have a performance sufficient for real-time predictive modeling on large real-world networks. In [11], authors combine multi-agent and network-based approaches and identify internal models by real data thus implementing emerging complex agent networks approach. The authors of [12] consider a bottom-up approach for what-if analysis, where agents (micro level) define the emergent behavior in a network (macro level). They used an egocentric network and various daily activities for the reproduction of post’s publications in microblogging-based Online Social Network.

Continuing the review of agent models, it is possible to single out other studies that differ in the features of the representation of agents and their characteristics. For example, some of the psychological characteristics of individuals, such as the curiosity, may improve the diffusion of information in traditional spreading models [13]. Thus, authors try to relate the characteristics of the dissemination of information to the personal properties of the agent. Sayin [14] singled out a list of the main points that must be taken into account for a realistic presentation of the information dissemination process like popularity of the source, strength of relations among users, content of the information and so on. She also highlights the presence of different internal user states like ignorant, aware or other. However, in this model there is no connection with the data of real social networks. The model presented in [15] takes into account heterogeneity of stateful agents. However, this research was conducted only on artificial networks, such as the small-world or the scale-free networks. In addition, the conductivity of links between users may differ thus influencing information spreading [16]. The idea is that users who have more common friends will be identified in the same group and have a greater weight for the dissemination of information. There are examples of studies where the characteristics of the agent are diverse, nevertheless the majority of existing algorithms use fixed probabilities of information transfer (i.e. agents are interchangeable).

From the review of the existing literature in the field, we may conclude that classical models of opinion dynamics do not use realistic network structures and properties of nodes while most of the cascade models use the assumption of homogeneity of nodes. Some studies that implement an agent approach investigate the issue of the interrelation between personal properties of an agent and characteristics of information process. However, there is clear lack of research combining: (i) a network approach with realistic heterogeneity of nodes and edges, (ii) support for different drivers and restrictions of information spread (message characteristics, internal state and user parameters, history of interaction with information sources, daily rhythm and so on) by combining models for different types of social network entities, (iii) identification of models from the data of large-scale social networks.

In this article, we propose data-driven approach for parallel modeling of information spread in social networks based on the composition of models. The model is identified using the data collected by Internet crawling; parallel version of modeling framework can handle millions of agents.

3 Method

This section contains the description of the approaches used in the study. In the first part, we describe main components of the model. The second part is dedicated to description of the parallel algorithm.

3.1 Description of Models

Proposed model of information spreading is discrete; each iteration corresponds to a daily time interval. We include in the model three main entities: IMs, communities and users. The last two are the vertices in the network. IM is a unit of information which can be transferred between vertices. The edges in the network are represented by the following links: a subscription (community – user), a friend (user – user). IM represents the post and may have such characteristics as topic, publication time, virality coefficient, number of likes, reposts, comments. Relationships between the main components of the model is presented in Fig. 1.

Fig. 1.
figure 1

Structural diagram of the model (Color figure online)

In the model, there are three main internal models characterizing the processes of information dissemination and defining the behavior of entities: model of IM’s generation, model of activity and model of reaction. Model of IM’s generation is responsible for the appearance of new messages, it determines the time of creation of messages and their content. The message generator can be either a society (external environment) or an individual user. Model of activity sets the status of each agent: e.g. active or inactive. If the result of the agent activity model is a transition to active status, an agent reads messages in a newsfeed (a set of awaiting messages from communities or users, provided with information on the current number of responses). This operation can be implemented based on requests from agents to communities, and it is possible to translate new information into agent pools. After that, the model of the reaction to IM determines the result of the user’s interaction with the messages available: inaction, approval (like), the generation of a text message (comment), participation in the dissemination of information (repost). If the user reposts the IM, it will appear in the newsfeed of his subscribers.

Implementations of internal models can be tailored to the peculiarities of considered social network and information process to be modelled. Thus, proposed framework can be used to implement various existing models. For example, independent cascade model [17] can be represented as a combination of: (i) simplified activity model (users always have an active status), (ii) user profile with three parameters: current state (involved, not involved) and two flags (isSeedNode, isNewNode), (iii) generation model which is turned on for seed users, (iv) reaction model which uses fixed probability p of message transmission. To make a model more specific, it is possible to combine a generation model based on an impulse random process, the reaction model based on decision theory [18], an activity model based on hidden Markov models [19] and so on.

Parameters of different models from Fig. 1 are tuned according to the data collected from a social network (types of data are represented in Fig. 1 by green blocks). Public data contains a large amount of information about behavior of users and communities: basic profiles, distribution of count/parameters of posts by hours/weeks, parameters of day/week user activity. In general, the relationship of users and the community forms a network where the edges represent a friendship and subscription relationships.

3.2 Parallel Implementation

To be able to finish simulation of the information spread for large social networks (up to several millions of nodes) in a reasonable time, it is necessary to use a parallel calculation scheme. It can be of great importance for the scenarios when one is trying to predict the coverage of a given information message using preliminary tuned models and real-time data assimilation.

Algorithm 1.

Parallel simulation scheme for the information spreading process (for a worker w)

figure a

Algorithm 1 presents a scheme of the simulation of the information spreading process in a social network (for a single worker with index w). The simulation is discrete and operates according to models described in Sect. 3.1: generative model Gm, activity model Am and reaction model Rm. This scheme was then implemented on a C++ language with MPI standard for message interchange.

This algorithm is an extension of the previously published Master-Slave algorithm from [20]. In this setting, Master nodes forward data between subnetworks and generate news, Slave (worker processes) store a subnetwork and perform local computations. The function dest(v) returns the index of worker hosting vertex v.

The model assumes that each information message (publication) in the social network stores three vectors that describe the state of each user with respect to this publication. More specifically, the publication stores:

  • viewers – users who have already seen this publication;

  • potential viewers – users who have not yet seen this publication, but have it included in their news feeds;

  • spreaders – users who decided to share this publication to all their subscribers (i.e. post it on their own personal page).

For the sake of memory optimization, these data are stored as bitsets, which represent a subset of the social network hosted on the current Slave. This scheme allows to reduce the overhead associated with storing and synchronizing information about IMs.

On each iteration, Slave process receives the list of generated news from the Master node (step 2). The main cycle loops through the common news feed of this Slave, starting from the most recent news to the very first publication that was generated in the system. For each publication, each user in the list of potential viewers of this publication is examined. If the user is not active, the cycle proceeds to the next user (steps 5–6). If this user is active, the system models his interaction with the publication according to the Rm. Rm determines how the user p at the moment t reacts to the publication n (steps 7–12). Depending on the type of reaction, statistics for the publication on this Slave is updated respectively. If the user wants to repost this publication, he is added to the list of spreaders of this publication. After simulation of user reaction, the user is moved from the list of potential viewers to the list of viewers of this publication (steps 13–14). Now the system has to process the list of spreaders. The publication will be offered to each subscriber of each of the spreaders (steps 15–21). These suggestions for subscribers from other subnetworks should be sent to corresponding workers. We aggregate suggestions to be sent to other workers in a set of pools send_pools (steps 17–18) and send them after spreaders are processed (step 24). Subscribers from the current subnetwork are added to the list of potential viewers if they have never seen this publication before (steps 20–21). Since after this processing the publication is offered to all subscribers, the list can now be cleaned up (step 22). When all local processing is done, Slave node receives all messages from other Slaves (step 25). Users from these messages are added to the lists of the potential viewers of respective news if they have never seen these publications before (steps 26–27). After data synchronization, the clock ticks and the cycle starts over from the step 2.

4 Experimental Study

4.1 Dataset Description

The research of information dissemination processes among users of social networks was held on the example of users of vk.com. The network was formed on the basis of a public community dedicated to charityFootnote 1, its subscribers and subscribers’ friends. To conduct the study, a full-size sample of the network consisting of one community, 294,345 subscribers and 33,478,369 subscriber friends was obtained via Internet crawling. The network contains 80,629,758 edges, which form links between 33,768,037 nodes of the network.

The dataset includes information about 805 IMs, including such characteristics as creation time, number of likes, reposts and comments, times of reposting and commenting and the user’s id for each activity.

According to the distribution of the characteristics of IMs, depending on the types of users (Fig. 2), it can be concluded that most of the responses characterizing the distribution and discussion of messages belong to the followers and their friends.

Fig. 2.
figure 2

Distribution of reaction to IMs

Figure 3 shows the distribution of activity during the day for each type of weekdays and weekends. To determine the parameters of the model, data on the activity of users on public walls (reposts, comments) were used, to calculate statistics, the data of users who committed 5 or more activities during the considered interval were used.

Fig. 3.
figure 3

Dynamics of daily activity for different groups of subscribers of the community, “wd” and “we” in legend denote weekdays and weekends, respectively

Based on data of the user’s reactions number, several groups of users with different probability of activity and distribution of users of these groups was obtained.

The collected data allow to simulate reactions on posts in the community on the subscribers’ network and their friends taking into account the characteristics of the aggregated agent profile, the daily and weekly rhythm of use of online contexts by agents, and the generation of new information messages by the community.

4.2 Simulation Scenario and Implementation of Internal Models

To investigate our approach, we apply our scheme to study the response of users (described in Sect. 4.1) to community’s posts taking into account the characteristics of the users and the community. Below is a description of the three internal models that were used in our scheme. In Sect. 3.1, it was described that the presented scheme allows to change these models according to the considered problem.

Community behavior in the network is presented in the IM generation model and is characterized by the frequency of generation of messages on the wall, depending on the time of day and day of the week. To restore the dynamics of message generation, the distribution of the number of publications on the community wall can be used, depending on the day of the week and the time of day. Each message has a parameter of virality, which is determined by the distribution of responses to messages in the dataset. This parameter is characterized by the Gamma distribution.

The activity model determines a current context of a user within a given time frame. At each moment a user has one of the following states: active (online), inactive (offline). The state is affected by type of daily activity, weekday/weekend schedule for types of daily activity and the parameter for determining the frequency to be online. The status is determined depending on the type of daily activity of the user (morning, day, night, uncertain). An undefined type refers to users who commit activities at different intervals of the day.

To determine if the user is actively working with information from his or her newsfeed (denoted as probability \( p_{nf } \)), we use the probability of active processing of information by a given type of agent, time interval and day of week \( p_{nf} \left( {i, t,d} \right) \) which is estimated by observed traces of activity on his or her wall. To account for the fact that: (i) a user may not use a social network each consequent day, (ii) only small part of the processed messages are shared by the user, and (iii) newsfeed of an agent contains information from several communities, we add a modifier \( \gamma \). So, the resulting expression is given as:

$$ p_{nf } = p_{nf} \left( {i,t,d} \right) \cdot \gamma , $$
(1)

where \( i \) is a type of agent according to Fig. 3, \( t \) is a current model time, and \( d \) is a current day.

To determine the dynamics of information spread, a reaction model is used. The reaction to the post in the community can be neglecting (no observable trace), approval (like), sharing (repost) and discussing (comment). Depending on the frequency of activity, several types of users’ reaction are distinguished, which determines the likelihood of doing possible actions:

$$ p_{r} \left( j \right) = p_{r} \left( {j,u} \right) \cdot v_{IM} , $$
(2)

where \( p_{r} \left( {j,u} \right) \) determines the probability of reaction of \( j \)-th type for user \( u \), depending on the type of reaction (like, repost or comment), \( v_{IM} \) – the virality of the messages. This parameter is set based on the gamma distribution of the number of responses per message, normalized, so that the average value is 1.

4.3 Simulation Results

The experiments were carried out using the resources of the supercomputer Lomonosov [21]. The modeling cycle consists of sequential iterations; the iteration duration is 10 min of model time. The program was run on 8 processes (one master and 7 slaves). The computation time of one run for modeling the activity in the network for 3 months is about 3 h.

Figure 4a shows a comparison of the number of posts for several days of the week, the MAE (mean absolute error) is 0.099. A comparison of the probability of posts is shown in Fig. 4b. MAE is 0.007. The unevenness of publications throughout the day is related to the peculiarities of a work of the administrator of considered community.

Fig. 4.
figure 4

Comparison of publication time and result of generative model

The dynamics of responses for several messages from the moment of publication is presented in Fig. 5. In our model, each IM has individual dynamics and the number of reactions associated with its characteristics, for example, publication time that affects the response in the first hours after publication or the virality factor that influences the opinion of users.

Fig. 5.
figure 5

Example of IMs’ life during simulation, types of lines denote different IM

Figure 6 depicts the dynamics of the reaction to the messages in the first 24 h from the time of publication. In our model, there are 4 basic types of agent activity (Fig. 3) and their distribution. Therefore, the publication time affects the response to the message. So, the number of responses at night is less than in the daytime. For example, the number of reactions to a message published in the evening is less within 5 to 10 h from the time of publication. Curve for morning reports increases faster because all users have opportunity to read the message during the day.

Fig. 6.
figure 6

Curves of repost reaction in first 24 h; m, d, e denotes morning, day and evening

The saturation time and the shape of the curves make it possible to reproduce the dynamics of the number of reposts, but no less important characteristic in the reproduction of dynamics is the finite number of reactions to information messages. In our model, there were three possible reactions to the message: like, repost and comment. Figure 7 shows the distribution of these characteristics for the real and model community. Model results are quite similar to the reaction from the real social network. Tails in a distribution based on real data are associated with the emissions that are present in the response to messages.

Fig. 7.
figure 7

Comparison of reactions to posts in the community and in the model, where the vertical axis represents the distribution of the number of reactions: (a) likes; (b) reposts; (c) comments

5 Conclusion and Future Works

In this paper, we describe the first results of parallel modeling of information spread based on the composition of models. This composition is presented by three internal models representing different drivers of information process and defining the behavior of entities: model of IM’s generation, model of activity and model of reaction. Each of the models is independent and can be tailored to the simulation problem under consideration. Further, we developed the parallel version of modeling framework, which can handle millions of agents. This part of the work is necessary for sufficiently fast processing of all entities, because social networks contain millions of users with different connections and characteristics.

Parameters of models are trained using the data collected from a Russian social network vk.com, which allows us to validate the proposed model. We compare several aggregated characteristics provided by model with the real data: number of IMs’ publications depending on the time of day and day of the week, the dynamics of response to a single message and accumulated reactions of subscribers to a series of IMs. The “violin plot” graphs demonstrate that the model allows to reproduce the aggregated response to messages reliably. Thus, modeling the behavior of individual users on the network, we were able to reproduce the reaction of the whole community to IMs (like a bottom-up approach from micro to macro level).

Future work includes automatic training of parameters of the models using meta-algorithm and investigation of importance of different drivers of information spread within proposed framework. A separate line of research will be devoted to modeling the longitudinal evolution of user states in terms of their level of involvement in interaction with the community. In addition, the important task is to improve the scalability of the framework, which will reduce the computation time for a large number of model entities.

This research was supported by The Russian Scientific Foundation, Agreement #14–21–00137–П (02.05.2017).