Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Online Social networks (OSN) are websites enabling users to build connections and relationships among each other. The OSN structure represents social interactions and relationships between entities which are the users of the OSN. Social networks are widely used by members for information sharing with the purpose of reaching as many friends as possible. The shared-information spread, is influenced by human decisions, and users are not fully aware of the possible consequences of their preferences when specifying access rules to their shared data. It is the responsibility of OSN administrators to effectively control the shared information, reduce the risk of information leakage, and constantly evaluate the potential risks of shared-information leakage. Most access rules are defined in terms of the degree of relationship required to access ones data. These rules are not refined enough to allow for dynamic denial of content from certain peers of the community.

We propose a model for access control that works with minimal user intervention. The model is based on users’ patterns of sharing information denoted as Sharing-habits. Naturally some users are more likely to share information with others. To minimize the probability of information leakage, the social network is analyzed to determine based on these habits, the probability of information flow through network connections. In a graph representation of the network, where edges indicate relationship between users, the challenge is to select the set of edges that should be blocked to prevent leakage of the shared information to unwanted recipients. We review some methods for handling and preserving privacy in social networks, and present our new privacy preserving approach, based on sharing-habits data. Our model combines algorithms that use graph flow methods such as max-flow-min-cut, and contract. Experimental results show the effectiveness of these algorithms in controlling the flow of information sharing to allow sharing with friends while hiding from others. The paper is structured as follows: in the next section we review related work, in Sect. 3 we define the privacy assurance in OSN problem, and in Sect. 4 we present our method for dealing with this problem. We explain our evaluation method and primary results in Sect. 5 and conclude by summarizing our contribution and discussing directions for future work in Sect. 6.

2 Related Work

There are various types of Online Social Networks, each with different properties. Privacy preservation can be viewed and handled from various aspects. Carmagnola et al. [5] present a research about the factors that help users identification, and information leakage in social networks, based on entity resolution. They conducted a study on the possible factors that make users vulnerable to identification, and of personal information leakage, and the perception of users about privacy related to the spreading of their public data. To find the risk factors, they studied the relations between the user behavior (habits) on OSNs and the probability of users’ identification. Kleinberg and Ligett [7] describe the social network as a graph where nodes represent users, and an edge between two nodes indicates that those two users are enemies that do not wish to share information. The problem of information sharing is described as the graph coloring problem, Kleinberg and Ligett [7] analyze the stability of solutions for this problem, and the incentive of users to change the set of partners with whom they are willing to share information. Tassa and Cohen [11], handle the information release problem, and present algorithms to compute an anonymization of the released data to a level of k-anonymity; the algorithm can be used in sequential and distributed environments, while maintaining high utility of the anonymized data. Vatsalan et al. [3] conducted a survey of privacy-preserving record linkage (PPRL) techniques, with an overview of techniques that allow the linking of databases between organizations while at the same time preserving the privacy of these data. In this paper Vatsalan et al. [3] present taxonomy of PPRL which characterize the known PPRL techniques along 15 dimensions, highlight shortcomings of current techniques avenues for future research. Jaehong and Ravi [6] present the ORIGIN CONTROL access control model where every piece of information is associated with its creator forever. Ranjbar and Maheswaran [1], describe the social network as a graph where nodes represent users, and an edge between two nodes indicates that those two users are friends that wish to share information. They present algorithms for defining communities among users, were the information is shared among users within the community, and algorithms for defining a set of users that should be blocked in order to prevent the shared information from reaching the adversaries, and leaking outside the community. In OSN, communities are subsets of users connected to each other; the community members have common interests and high levels of mutual trust, it can be described by a connected graph, where each user is a node in the graph, and an edge connecting two nodes indicates a relationship between two users. A community is defind by Ranjbar and Maheswaran [1] from the view point of an individual user. myCommunity is defined as the largest sub-graph of users who are likely to receive and hold the information without leaking. In other words, myCommunity is the subset of an individual users friends that have intense and frequent interactions and describes a grouping abstraction of a set of users that surrounds an individual based on the communication patterns used for information sharing. Our study is based on the ideas described in their paper; while they only share information within the defined community, and block users that might leak information to adversaries, we relax the limitation defined in their study, and block only edges on the path to the adversaries, instead of blocking all the information from the source user to the users that might leak the information.

3 The Privacy Assurance in OSN Problem

In this section we define the general problem of privacy assurance in OSN and our proposed method that uses information from users sharing-habits.

Let \(G = (V,E)\) be a directed graph that describes a social network, where V is the set of network’s users, and E is the set of directed and weighted edges representing the users’ information flow relationships. An edge \((u_i,u_j)\in E\) exists only if \(u_i\) shares information with \(u_j\). The distance between two vertices, \(dist_{G}(u_i,u_j)\) is the length of the shortest path from \(u_i\) to \(u_j\) in G. Ego is an individual focal node, it is the specific user from which we consider the information flow. A network has as many egos as it has nodes, ego-community is the collection of ego and all nodes to whom ego has a connection at some path length. The \(\delta \)-community of a user, represented by the ego vertex \(u_i\) is the sub-graph \(G_{\delta }(u_i)\)=(\(V_{\delta }(u_i)\),\(E_{\delta }(i)\)), where for each \(v_i \in V_{\delta }(u_i)\), \(v_i\) \(\ne \) \(u_i\), \(dist_{G}(u_i,v_i)\) \(\le \delta \).

The following definitions are as defined by Ranjbar et al. [1]: \(p_i\) is the probability that user \(u_i\) is willing to share the information with some of his friends.

$$\begin{aligned} p_i= \left\{ \begin{array}{clrr} (outflow/inflow) &{} &{} (outflow < inflow), \\ 1 &{} &{} (outflow \ge inflow). \end{array} \right. \end{aligned}$$
(1)
  • Outflow is the number of sharing interactions from \(u_i\) to his friends.

  • Inflow is the number of sharing interactions from \(u_i's\) friends to \(u_i\).

The likelihood of \(u_i\) sharing information with \(u_j\) along the edge \((u_i,u_j)\) is represented by \(w_{i,j}\), the weight on the edge \((u_i,u_j)\); This weight is derived from the relationship between \(u_i\) and \(u_j\), it is a fixed number indicating the willingness of \(u_i\) to share information with \(u_j\), it does not change or change very infrequently, and may be set by the user. The probability of flow between two neighbor users, \(u_i\) and \(u_j\) is denoted as \(p_{ij}\), and calculated by \(p_{i,j}=p_i \times w_{i,j}\). Since the flow may change quite often this probability may also changes with it. We assume that the user behavior is consistent; user \(u_i\) shares all the data with user \(u_j\) with probability \(p_{i,j}\). This probability can change with time, but it does not depend on the content of the shared information. The Probability of Information Flow (PIF), is the maximum probability of information flow throughout the entire paths between \(u_i\) and \(u_j\). A path probability flow between \(u_i\) and \(u_j\) is the flow of the edge with the minimum \(p_{i,j}\). It is denoted as \(PATH_{i,j}\). The PIF is the maximum among of all paths between \(u_i\) and \(u_j\) of \(PATH_{i,j}\). The function f which denotes flow, is computed by using the log of the edges’ probabilities on a path between \(u_i\) and \(u_j\). To prevent information flow from one user to another we search for the minimal set of edges that when removed from the community graph, or blocked, disables the flow. We denote this set of blocked edges as B. Note that after edges are removed, the PIF and therefore f should be recomputed.

3.1 Problem Goal

Our aim is to enable a user \(u_i\) to share information with as many friends and acquaintances as possible, while preventing information leakage to adversaries within the user’s community. Ranjbar et al. [1] describe a method for sharing information within the source user \(u_i\) defined community, while blocking users (friends and acquaintances) that might leak information to adversaries. We relax the limitation due to blocking friends, and instead of blocking all the information from the source user \(u_i\) to the users that might leak the information, block only edges on the path from \(u_i\) to his adversaries. We use the following criteria to define and evaluate the resulting \(u_i\) ego-community graph:

  1. 1.

    Minimum Friends Information Flow: the minimum information flow from \(u_i\) to every user within his community must preserve a certain percentage of the original information flow to every user denoted by \(\alpha \).

    Let \(G_{\delta }(u_i) = (V_{\delta }(u_i),E_{\delta }(u_i))\) be the \(\delta \)-community of \(u_i, v \in V(u_i)\)

    $$\begin{aligned} f(u_i,v) \ge \alpha \cdot f_{original}(u_i,v) \end{aligned}$$
    (2)
  2. 2.

    Close Friends Distance: Close friends are defined by their distance from \(u_i\). \(G_{\beta }(u_i) = (V_{\delta }(u_i),E_{\delta }(u_i))\) is the \(\beta \)-community of \(u_i, v \in V(u_i)\), \(\beta < \delta \). This criteria reflects the requirement that all the users within \(u_i's\) \(\beta \)-community must receive the entire information from \(u_i\), and cannot be blocked.

    Let B be the set of blocked edges, than

    $$\begin{aligned} B \subset \{ (u_s,u_t) |d_{G_{\delta }}({u_i},{u_s}) \ge \beta , u_s,u_t,u_i \in V_{G_{\delta }}(u_i)\} \end{aligned}$$
    (3)

    We assume that there are no adversaries within \(u_i's\) \(\beta \)-community, otherwise the above condition can never be fulfilled.

  3. 3.

    Maximum Adversaries Information Flow: the maximum information flow from \(u_i\) to each of his adversaries cannot be more than \(\gamma \) from the original information flow to each adversary.

    $$\begin{aligned} f(u_i,u_{adv}) \le \gamma \cdot f_{original}(u_i,u_{adv}) \end{aligned}$$
    (4)

For example the threshold parameters can be: \(\alpha =0.9\), \(\beta =2\), and \(\gamma =0.1\). The problem goal is to remove the least number of edges such that the three Eqs. 2, 3, 4 will be satisfied.

Fig. 1.
figure 1

\(u_i's\) community graph (Color figure online)

Figure 1 describes a \(\delta \)-community graph for \(u_i\). The dotted area surrounds \(u_i's\) \(\delta \)-community graph with \(\delta =4\), i.e. all acquaintances within distance \(\le 4\). The blue area surrounds \(u_i's\) \(\beta \)-community, i.e. all friends within distance \(\le 2\).

As shown by the figure the \(\delta \)-community of friends is much larger than the \(\beta \)-community of close friends.

3.2 Cuts in Graphs

A cut in a graph is a set of edges between two subsets of a graph, one containing \(u_i\), and the other containing \(u_i's\) adversaries, such that when removed, prevents information flow from one subset to the other.

A naive algorithm for solving the problem would be an algorithm that finds any cut between the adversaries’ set and \(u_i's\) community, and defines this cut as the blocked edges list. Algorithm 1 is a naive algorithm for blocked users.

The naive algorithm is not suitable for our problem, since it doesn’t comply with the (1) Minimum Friends Information Flow, (2) Close Friends Distancecriteria of our problem. Condition (1) requires minimum information flow from \(u_i\) to all members in \(u_i's\) community, the naive algorithm doesn’t handle this requirement. Condition (2) defines close friends by their distance from \(u_i\), the naive algorithm doesn’t handle this requirement. While the naive algorithm is not sufficient to our problem, it is important for understanding the theoretical problem defined here.

figure a

4 The Sharing-Habits Based Privacy Assurance in OSN Solution

In our solution we propose a model for finding the set of edges that should be blocked in order to achieve maximum information sharing among the community of the information source with minimum information leak. Our model uses two methods for defining candidate sets for blocked edges, along with the evaluation method for choosing the best set to be blocked. Our method consists of two major steps, the first is the initialization step that creates a multi-graph with a super-vertex \(s_1\) containing \(u_i's\) \(\beta \)-community, this step is described in Subsect. 4.1. The second step described in Subsect. 4.2, uses two methods to find candidates-sets for blocked edges.

Algorithm 2 warps these steps to construct the set of edges to be blocked.

figure b
Fig. 2.
figure 2

Construct blocked edges main building blocks

Figure 2 describes the main building blocks of the algorithm for defining the edges to be removed from \(u_i's\) \(\delta \)-community in order to prevent information leakage to \(u_i's\) adversaries.

Next we detail each one of these building blocks.

4.1 Initialization

The \(\delta \)-community of \(u_i's\) consists of all users \(u_j\) connected to \(u_i\) with a path with distance \(\le \delta \). The \(\beta \) parameter defines the size of the community of close friends. Therefore, a \(\beta \)-community of \(u_i\) would be a sub-graph contained in \(\delta \)-community were \(\beta \le \delta \), as demonstrated in Fig. 1. The privacy criteria that is defined in Subsect. 3.1 requires that the entire information shared by \(u_i\) must be shared with \(u_i's\) close friends (2). In order to comply with (2), the Initialization step creates a multi-graph with one super-vertex \(s_1\) containing \(u_i\) and his close friends with distance \(\le \beta \). This step ensures that the algorithm won’t define edges for blocking on paths between \(u_i\) and his close friends, since \(u_i\) and his close friends are in the same super-vertex, \(s_1\), see Fig. 3.

Figure 3(a) describes a \(\delta \)-community graph for \(u_0\), with 10 members, \(\delta \)=3, 4 are close friends with distances=1 (blue vertices), 4 acquaintances (green vertices), and 2 adversaries (red vertices). Figure 3(b) describes the graph after initialization.

4.2 Construct Blocked Edges Candidates

We use two methods derived from flow problems, to find the initial candidates-set of edges to be blocked. This candidates-set is a cut between two sets of vertices, one set containing \(u_i\), \(u_i's\) \(\beta \)-community, and some vertices from of \(u_i's\) \(\delta \)-community. The other set containing the remaining part of \(u_i's\) \(\delta \)-community, and \(u_i's\) adversaries.

This candidates-set is evaluated to filter out the final candidates-sets by selecting a set that complies with the required privacy criteria. This process is described in Sect. 4.3; the two methods we use for finding the initial candidates-sets of edges to be blocked are:

  1. 1.

    Min-Cut: based on Ford-Fulkerson [4], Max-flow-min-cut algorithm, to find the minimum cut between super-vertex \(s_1\) that contains \(u_i\) and his close friends, and each of \(u_i's\) adversaries. This process is described in Subsect. 4.2.1.

  2. 2.

    Contract: based on Karger et al. [9], contract algorithm, to find any cut between super-vertex \(s_1\) that contains \(u_i\) and his close friends, and each of \(u_i's\) adversaries. This process is described in Subsect. 4.2.2.

4.2.1 Block Edges by Min-Cut

Algorithm 3 implements the Sharing-habits privacy assurance based on the max-flow min-cut method by Ford and Fulkerson [4], and then checks for privacy criteria compliance:

  1. 1.

    Find a minimum cut between super-vertex \(s_1\) and \(u_i's\) adversaries [4].

  2. 2.

    Check if the cut complies with the required privacy criteria as defined in Subsect. 3.1, and select the final candidates-set. This process is described in Subsect. 4.3.

figure c
figure d

Algorithm 5 is called by Algorithm 4 to find a cut between two vertices by randomly selecting an edge and contracting the two vertices connected by the selected edge into one super-vertex.

figure e
Fig. 3.
figure 3

\(u_0's\) \(\delta \)-community graph: (a) \(u_0's\) community (b) after initialization (Color figure online)

Fig. 4.
figure 4

Contract: (a) Edge (5,10) was randomly selected, (b) Edge (5, 2) cannot be selected, since the algorithm can’t contract a super-vertex containing \(u_0\) with a super-vertex containing \(u_0's\) adversary. (Color figure online)

Fig. 5.
figure 5

Contract: (a) Edge (3, 7) is randomly selected (b) The obtained cut from one run of Contract algorithm (Color figure online)

4.2.2 Block Edges by Contract

The minimum cut between \(G_{\beta }(u_i)\), and \(u_i's\) adversaries, found by BlockEdgesByMinCut algorithm, might not be the optimal solution for our problem, since the edges in this cut may not satisfiy the privacy criteria. Thus, we use the contract algorithm, that finds a variety of other cuts possibly complying with the required privacy criteria.

Algorithm 4 implements the Sharing-habits privacy assurance based on the contract method by Karger and Stein [8, 9].

In each iteration, the contract algorithm finds a different cut between the super-vertex containing \(G_{\beta }(u_i)\) and the super-vertex containing \(u_i's\) adversaries. The contract algorithm repeatedly contract vertices to super-vertices until it gets two super-vertices connected by a set of edges that defines a cut between the two sets of vertices contained in each super-vertex.

Algorithm 4 is composed of the following main steps:

  1. 1.

    Find a cut between super-vertex \(G_{\beta }(u_i)\) and \(u_i's\) adversaries; this step uses the contract algorithm presented [8, 9]

  2. 2.

    Check if the cut complies with the required privacy criteria as defined in Subsect. 3.1, and select the final candidates-set. This process is described in Subsect. 4.3.

Figures 4 and 5 describe a simple community graph and some steps of one run of the contract algorithm.

figure f

4.3 Compute Final Candidates Set

After selecting the initial candidates-set of edges to be blocked, each method uses Algorithm 6 for selecting the final candidates-set of edges that should be removed from \(u_i's\) \(\delta \)-community graph. In the first step of the algorithm, we check if by removing the initial-candidates-set of edges from \(u_i's\) \(\delta \)-community graph, the remaining \(\delta \)-community graph for user \(u_i\) complies with the required privacy criteria. If it doesn’t comply with the required privacy criteria, we try to remove edges from the initial blocked candidates-set, and insert them back into \(u_i's\) \(\delta \)-community graph, until the remaining community graph complies with the required criteria, or until we tested the entire edges in the initial candidate-set, and couldn’t find a set of edges to be blocked. We propose three methods for selecting and removing an edge from the initial candidates-set, and insert the selected edge back to \(\delta \)-community graph:

  1. 1.

    Randomize: select an edge randomly.

  2. 2.

    Maximum PIF: select the edge with the maximum probability of information flow.

  3. 3.

    Minimum PIF: selecting the edge with the minimum probability of information flow.

Algorithm 6 implements the three methods and Algorithm 7 tests the criteria.

figure g

5 Evaluation

In this section we describe the evaluation method we use for the proposed algorithm, and the results we obtained using real data [10]. We first demonstrate our methods and the difference between them using a toy community.

5.1 Demonstration on a synthetic community

We demonstrate our algorithms on a small graph representing a synthetic community that we built from the example in [2], containing 11 vertices, and 23 edges. We selected community distance \(\delta \) = 3, close friends distance \(\beta \) = 1, and assigned 2 adversaries. The algorithms were tested with different probabilities of information flow from source user \(U_0\) to the community members. In the following example, Fig. 6 describes the synthetic community graph with high probability of information flow on the edges to adversaries. This situation simulates a collision, and it is hard to select \(\alpha \) and \(\gamma \) such that we get minimum leakage of information flow to \(u_i's\) adversaries, and maximum information flow to \(u_0's\) community.

Fig. 6.
figure 6

Synthetic community graph with collision (Color figure online)

In this community graph \(U_0\) is the source, \(U_0\) has four close friends: 1, 2, 3, 4, four acquaintances: 5, 6, 7, 8, and two adversaries: 9, 10.

Each adversary has three incoming edges.

{(6, 9), (5, 9), (8, 9) } with probabilities (0.19, 0.95, 0.8) respectively.

{(5, 10), (7, 10), (8, 10) } with probabilities (1, 0.85, 0.95) respectively.

The maximum probability of information flow from \(U_0\) to the members of his community graph is depicted in Table 1.

Table 1. PIF from \(U_0\) to his community

Next, using this example we show why the contract approach has better chance of finding a good set of edged that can be blocked while satisfying the privacy criteria.

Table 2. Min-Cut candidates
Table 3. Contract candidates

Block edges by Min-Cut method. The Minimum cut found by Min-Cut method is depicted in Table 2. If we remove the initial candidates-set edges from \(u_0's\) community graph, the probability of information flow to 7 and 8 will be 0, meaning no flow at all. In the final step of Algorithm 6, we try unblocking each edge from the initial candidates-set, and reach the required privacy criteria, which is computed by Algorithm 7; in this example the only edge that improves the PIF to community without increasing the information leakage to \(u_0's\) adversaries is (3, 7), thus the final candidates-set is {(3, 7)}. In this example we can’t define \(\alpha \), and \(\gamma \) with values that comply with the required privacy criteria, which is computed by Algorithm 7.

Block edges by Contract method. A Cut found by an iteration of contract method is depicted in Table 3.

If we remove the initial candidates-set edges, the probability of information flow to 5, 6, and 8 will be 0, meaning no flow at all. Algorithm 6 tries unblocking each edge from the initial candidtes-set, and reach the required privacy criteria, which is computed by Algorithm 7; the final candidates-set is empty, since each edge we unblock not only improves the information flow to \(u_0's\) community, but also increases the information leakage to \(u_0's\) adversaries.

It is obvious that when the edges to the adversaries have high probabilities, the max-flow-min-cut methods might not select those edges, and might not find a solution that comply with the required privacy criteria, while the contract method might find the trivial cut that contains only the edges to the adversaries, and thus comply with the required privacy criteria.

5.2 Test on SNAP Database

We evaluated our algorithms on the Facebook network data from Stanford Large Network Data-set Collection [10]. The SNAP library is being actively developed since 2004 and is organically growing as a result of Stanford research pursuits in analysis of large social and information networks. The website was launched in July 2009. The social network graph describes the Social circles from Facebook (anonymized) and consists 4,039 nodes (users), and 88,234 edges, it describes the Social circles from Facebook (anonymized). We took the structure and relationship from the SNAP database, and assigned random probabilities to the edges in the network graph in the following way. We defined four types of users, the type reflects the user’s willing to share information: very high sharing users, medium, sometimes, and very low. For each user in the graph we randomly assigned a type. To conform the edges’ probabilities to the users’ types, we randomly assigned probabilities to the users’ edges according to their types, from the following ranges: very high sharing users (probability 0.75-1), medium (0.5-0.75), sometimes (0.25-0.5), very low (0-0.25). The four types were generated uniformly among all the network users.

Preliminary Results. Tables 4, 5 summarize the results of four different evaluation runs, for different communities.

Table 4. Data size

Table 4 presents four runs with the four different sub-communities. The community size is derived by the user selected as the sharing user. Friends column refers to the amount of first degree friends. Table 5 present the results obtained by the four runs. Columns 2–3 and 4–5 present the initial set of edges to be blocked and the final set of edges found by min-cut and contract algorithm respectively. Columns 6–8 present the threshold parameters used for the run. The difference between the two algorithms is the method for finding the initial candidates set, min-cut versus contract. Both algorithms use the same method for computing the privacy criteria. For each community graph we performed the algorithms with extreme thresholds, (\(\alpha =0, \beta =1, \gamma =1\), and \(\alpha =1, \beta =1, \gamma =0\)), and with random thresholds. The remark indicates which edges were found as candidates for blocking. We can see that in the simple case (e.g., run 1 and 2) the solution is trivial and the blocked edges were the edges to the adversaries. While both algorithm are complete, in the non trivial cases, min-cut finds the best solution with respect to blocking adversaries, while contract may return a compromised solution that is less efficient in blocking adversaries but allows more sharing with friend. However, the time performance of the contract is much better.

Table 5. Evaluation runs results

It is important to note that the contract algorithm if executed multiple times, is guaranteed to eventually find the optimal solution with respect to the threshold criteria. In the case where there is no solution, the contract algorithm will provide the best cut that satisfies the threshold.

6 Conclusion

The problem of uncontrolled information flow in social network is a true concern to ones privacy. In this paper we address the need to follow the social trend of information sharing while enabling the owner to prevent their information from flowing to undesired recipients. The goal of the suggested method is to find the minimal set of edges that should be excluded from ones community graph to allow sharing of information while blocking adversaries. To reduce side effect of limiting legitimate information flow, we minimize this impact according to the flow probability. Our algorithms can be used within the ORIGIN CONTROL access control model [6]. In this model every piece of information is associated with its creator forever. The set of cut edges found by our algorithms, is stored for each user and can be checked when the origin controlled information is accessed. This way the administrator can check whenever this information is access by a certain user, if the edge between them was cut for the originator user. In future work, we intend to expand the evaluation and test our algorithms on different types of social networks (e.g., twitter). We intend to further explore more approaches to identifying the edges to be blocked, such as genetic algorithm.