1 Introduction

With the rapid development of Internet and mobile Internet, social network has become an important platform for people’s online life. For example, as one of the most prevalent social network platforms in China, WeChat has more than 800 million monthly active users [1]. Social networks are not only effective tools in connecting individuals, but also powerful platforms for delivering services, which leads to the transformation from traditional social networks to service-oriented social networks. In the service-oriented social networks, to discover a small subset of influential individuals is particularly important. From service providers’ perspective, it is cost-effective to target these influential individuals only when they want to promote some services, because these influential individuals are more conductive to propagate services in the form of “word-of-mouth” [2]. From users’ perspective, it is trustworthy to follow the recommendations from these influential individuals when they expect to obtain high-quality services. Formally, discovering influential individuals is referred to as influence maximization.

During the past few decades, extensive approaches [3,4,5, 7,8,9,10,11] have been proposed to solve the influence maximization problem. These approaches mainly fall into two categories: (1) greedy algorithms [3, 5, 8, 11], which possess high performance guarantee but are time-consuming; and (2) heuristic algorithms [4, 7, 9, 10], which are time-efficient but lack performance guarantee. Therefore, it is essential to devise new algorithms that have both high efficiency and high performance guarantee.

In this paper, we propose a novel approach with both efficiency and performance guarantee to improve Kempe’s greedy algorithm [8]. The basic idea of this approach is to discover influential individuals within communities rather than the entire network, as community structure is a basic and important property of social networks [14] and has prominent effect on the influence spreading process [6, 15]. Intuitively, a community is a set of nodes with dense internal connections and sparse external connections. Individuals within a community tend to have more communications and thus are more likely to influence each other, while individuals across communities tend to have less contacts and thus are less likely to influence each other. Therefore, it is a good approximation to discover influential individuals within communities rather than the entire network. The proposed approach contains two phases: community detection in the first phase and influential individuals discovery in the second phase.

To detect high-quality communities, we first exploit LINE [16], one of the most popular network embedding methods, to extract a d-dimensional vector representation for each node in this network, which preserves the network neighbourhood relationships well. Then the network is partitioned into c communities by utilizing classic k-means algorithm [13] on the basis of the obtained vector representations. After obtaining the communities, we propose two community-based approximation algorithms to discover influential individuals. We first propose the basic community-based approximation algorithm BCAA, which is c times faster than Kempe’s greedy algorithm, where c denotes the number of communities. BCAA is a simple improved version of Kempe’s greedy algorithm, and the only difference between them is that BCAA estimates the influence spread of a subset of nodes within each individual community instead of the whole network. To further speed up BCAA, we propose the improved community-based approximation algorithm ICAA, which can avoid many wasteful computations by taking advantage of the submodularity property of influence spread (see more details in Sect. 2). We further analyze the performance guarantee of the proposed approach and show that both BCAA and ICAA can obtain a \((1 - e^{- \frac{1}{1 + (c - 1) {\varDelta } I_c}})\) approximation to the optimal solution, where \(\varDelta I_c\) is the maximal influence spread of a node in the communities that do not contain this node.

In all, our contribution is three-fold: (1) A new community detection method based on network embedding is proposed to detect high-quality communities; (2) Two novel algorithms with high performance guarantee are proposed to discover influential individuals by exploiting the community structures of social networks; (3) Extensive experiments are conducted to demonstrate the effectiveness and efficiency of the proposed algorithms.

2 Problem Statement

A social network can be modeled as a weighted graph \(G = (V, E, P)\) with \(n = |V|\) nodes and \(m = |E|\) edges. Each directed edge \(e = (u, v)\) between nodes u and v is associated with a weight \(p_{uv} \in [0, 1]\) in P, which represents the probability that node u influences node v.

Let \(S \subseteq V\) be the subset of nodes selected as the initial target nodes for influence spreading. We define the influence spread of S, denoted by I(S), as the expected number of nodes that are eventually influenced by S under certain spreading model. It is worth noting that I(S) is a submodular function, i.e., \(I(S \cup \{v\}) - I(S) \ge I(T \cup \{v\}) - I(T)\), for all \( v \in V\) and \(S \subseteq T \subseteq V\).

To estimate the influence spread I(S), the spreading model should be determined at first. Here, we adopt the independent cascade (IC) model [8]. In IC model, each individual node has two states: active and inactive, and the influence spreading process unfolds in discrete timestamps according to the following rules. When node u becomes active at timestamp t, it can make an attempt to activate each inactive neighbour node v with probability \(p_{uv}\) at timestamp \(t + 1\). However, u cannot make any further activation attempts at subsequent timestamps. The spreading process runs until no more activations are possible.

Definition 1

(Influence Maximization Problem). Given a weighted graph \(G = (V, E, P)\) and a parameter k, the influence maximization problem aims at discovering a size-k subset of nodes \(S \subseteq V\) such that I(S) is maximal.

3 Proposed Solutions

3.1 Network Embedding Based Community Detection

Network embedding aims at extracting low-dimensional high-quality features for each node in the networks. Definition 2 presents its formal definition.

Definition 2

(Network Embedding). Given a network \(G = (V, E)\), the goal of network embedding is to embed each node \(v \in V\) into a low-dimensional space \(R^d\), that is, to learn a mapping function \(f_G: V \rightarrow R^d\), where \(d \ll |V|\). In space \(R^d\), the network neighbourhood of each node is well preserved.

In this paper, we employ the network embedding model LINE [16], which aims to preserve both the first-order proximity and the second-order proximity of a network. The first-order proximity refers to the local pairwise proximity, while the second-order proximity refers to the similarity of two nodes’ neighbourhood network structures. We choose LINE because it preserves the community structures well. Intuitively, two nodes that are directly linked or share many common neighbours are more inclined to be included in a same community. After obtaining the low-dimensional vector representation of all the nodes, we exploit the classic k-means algorithm [13] to partition the network into c communities. This network embedding based community detection (NECD) procedure can detect high-quality communities with c properly set, and flexibly control the number of communities with reasonable quality guaranteed.

3.2 Basic Community-Based Approximation Algorithm BCAA

figure a

In this part, we devise BCAA to improve Kempe’s greedy algorithm [8] by taking advantage of network communities. BCAA is outlined in Algorithm 1. Building on the NECD procedure, BCAA first partitions network G into c communities. Then, on the basis of these c communities and under the IC model, BCAA discovers seed nodes one by one iteratively. In each iteration, BCAA selects the node with maximal marginal influence spread as the next seed node (Steps 4-8). However, BCAA computes each node’s marginal influence spread within each individual community instead of the entire network, i.e., \(M_C(v) \leftarrow I_C(H \cup \{v\}) - I_C(H)\), where \(I_C(\cdot )\) and \(M_C(\cdot )\) denote community-based influence spread and community-based marginal influence spread respectively, and H denotes the seed nodes contained in the community that contains v (Steps 5-7).

3.3 Improved Community-Based Approximation Algorithm ICAA

In this part, we devise ICAA to improve BCAA by taking advantage of the submodularity of \(I(\cdot )\). The key idea of ICAA is that there is no need to immediately recompute the community-based marginal influence spread for all the nodes in \(V \setminus S_{i - 1}\) in each iteration i. This is because the community-based marginal influence spread of node v computed before is an upper bound of v’s current community-based marginal influence spread. What’s more, the seed nodes contained in one community cannot affect the community-based marginal influence spread of nodes contained in any other community. Thus, when we are going to find a new seed node, we first choose the node with the maximal community-based marginal influence spread as a candidate, then we check if the marginal influence spread of this node should be recomputed. If not, this node is chosen as the next seed node, otherwise we recompute the community-based marginal influence spread of this node. ICAA is outlined in Algorithm 2.

figure b

ICAA initially partitions network G into c communities via the NECD procedure. Then, ICAA calculates the community-based influence spread for each node \(v \in V\), and pushes a corresponding 3-tuple \((v, 0, M_C(\{v\}))\) into a priority queue Q (Steps 3-5). Here, the second element f of the 3-tuple represents the number of seed nodes that are contained in the community that contains v. Obviously, f should be 0 for each node before the first seed node is determined. Besides, each 3-tuple has a priority associated with the third element, and the 3-tuple whose third element is larger has higher priority. Hence, the node u corresponding to the first 3-tuple in Q has the largest community-based marginal influence spread. Then, ICAA takes u as the first seed node (Steps 6-7). Since u has been selected as a seed node, the community-based marginal influence spread of each node contained in the community that contains u (denoted as \(C_u\)) should be recomputed. By the submodularity property of \(I(\cdot )\), one can see that the community-based marginal influence spread of each node is non-increasing as more and more seed nodes are determined. That is, the third element of the 3-tuple corresponding to each node is an upper bound of its current community-based marginal influence spread. Building on this observation, the update of the community-based marginal influence spread of each node contained in \(C_u\) can be delayed, which will reduce many wasteful computations. Thus, in the while loop, ICAA chooses the node u corresponding to the first 3-tuple in Q as a candidate seed node rather than a new one (Step 9). Assume that the current number of seed nodes contained in \(C_u\) is \(n_u\). If \(f < n_u\), the community-based marginal influence spread of u is recomputed and f is updated to \(n_u\), then the updated 3-tuple is pushed into Q again (Steps 14-16). If \(f = n_u\), node u is selected as the next seed node directly (Steps 18-19). According to this strategy, ICAA discovers the k most influential nodes iteratively.

Let \(\varDelta I_c\) denote the maximal influence spread of a node in the communities that do not contain this node. Now, we analyze the performance guarantee of BCAA and ICAA in Theorem 1.

Theorem 1

Both BCAA and ICAA obtain a \((1 - e^{- \frac{1}{1 + (c - 1) {\varDelta } I_c}})\) approximation to the optimal solution.

4 Experiments

4.1 Experimental Settings

In the experiments, we evaluate our proposed approaches on three real-life social networks: WeChat [1] (1 K nodes, 7 K edges and 10 communities), Facebook [12] (4 K nodes, 88 K edges and 10 communities), and Epinions [12] (76 K nodes, 406 K edges and 20 communities). Since the original networks are unweighted, we use the number of common neighbours between two individuals u and v to denote the weight of edge \(e = (u, v)\), i.e., \(w_{uv} = |nb(u) \cap nb(v)|\), which is used in the NECD procedure. Here we use nb(u) to denote the union of u and its neighbours. The propagation probability of edge \(e = (u, v)\) is defined as follows.

$$\begin{aligned} p_{uv} = 2 \frac{|nb(u)| - 1}{|nb(v)| - 1} \cdot \frac{|nb(u) \cap nb(v)|}{|nb(u) \cup nb(v)|} \bar{p} \end{aligned}$$
(1)

where \(\bar{p}\) is the average propagation probability of the whole network. In our experiments, \(\bar{p}\) is set to be 0.05.

We employ conventional running time and approximation ratio as evaluation metrics. Running time is used to measure the time efficiency of the proposed algorithms. Approximation ratio is used to measure the approximation degree to the optimal solution \(I(S^*)\), which is defined as \(I(S) / I(S^*)\).

To evaluate the performance of our proposed algorithms, we select four representative approaches for comparison, which includes two greedy algorithms: GA [8] and CELF++ [5], and two heuristic algorithms: IMRank [4] and Random [3].

Fig. 1.
figure 1

Running time testing via varying k

Fig. 2.
figure 2

Approximation ratio testing via varying k

4.2 Experimental Results

In the experiments, we fix the dimension number d used in the NECD procedure at 60, and set the number of Monte Carlo simulations t in the IC model as 100.

Exp-1: Running time testing via varying k . In this experiment, we vary the size of seed node set k from 1 to 30 to evaluate the efficiency of different algorithms. Figure 1 depicts the results. Note that we use logarithmic scale for y-axis in this figure. From Fig. 1, we can see that the heuristic algorithms IMRank and Random run very fast, while the greedy algorithms GA and CELF++ run much slower. For our proposed algorithms BCAA and ICAA, we see that both BCAA and ICAA are several orders of magnitude faster than GA and ICAA runs much faster than CELF++ as well. From Fig. 1, we can also see that the running time of ICAA almost does not change when k increases. This is due to the fact that the main time cost of ICAA is to compute the community-based marginal influence spread for every node in the first iteration, and it takes a little time to find the other \((k - 1)\) influential individuals in the subsequent iterations.

Exp-2: Approximation ratio testing via varying k . The objective of this experiment is to evaluate the degree of approximation of different algorithms by taking the results of GA as the ground truth. As shown in Fig. 2, CELF++ has the highest approximation ratio, while Random has the lowest one. The approximation ratio of IMRank is unstable and it mainly falls in the range [0.5, 0.8]. However, BCAA and ICAA have much more stable and much higher approximation ratio. In particular, as k grows larger, the approximation ratio of BCAA and ICAA becomes as close as possible to 1. This result verifies our previous performance guarantee analysis.

5 Conclusion

In this paper, we study the influence maximization problem on service-oriented social networks via taking into account community structures. First, we exploit the classic k-means algorithm based on network embedding to detect communities. Next, we propose the basic community-based approximation algorithm BCAA, which discovers influential individuals within communities instead of the entire network, and then propose the improved community-based approximation algorithm ICAA to further speed up BCAA. We further provide performance guarantee analysis of the proposed algorithms. Finally, we validate our proposed algorithms through experiments.