Keywords

1 Introduction

With the proliferation of the web and online social networks, social-affiliation networks – social/friendship networks where users have many, binary attributes or affiliations – have become increasingly common. Examples include social networking sites such as Facebook and Google+ which record user engagement, e.g., pages liked (attributes are pages – yes if liked, no if not), media-sharing social platforms such as Flickr and Youtube where users can form groups based on their interests (attributes are groups – yes if member, no if not), location-based social networks like Gowalla where users can check-in at a location they physically visit (attributes are locations – yes if visited).

We consider two closely-related research questions concerning these networks: [RQ1] What rules (patterns) do the various structural properties of social-affiliation graphs – e.g., edge or triangle count – follow, in relation to its attributes? [RQ2] How can we synthetically generate realistic networks which provably satisfy these patterns? These questions fall under the umbrella of pattern analysis and modeling, a well-explored research area and a standard practice in understanding real-world graphs [6, 16, 17, 19]. Our interest in considering these research questions stems in part from the scientific and practical impact that the works on pattern analysis and modeling have had in the past. The discoveries of the scale-free property (skewed degree distributions [10]) and the small world property (small graph diameters [28]) and respectively their preferential-attachment [4] and forest-fire [19] models, for instance, have had numerous applications in graph algorithm design, anomaly detection, graph sampling and more [3, 18].

While works on patterns and models for non-attributed graphs abound in the literature, studies dealing with social-affiliation networks are somewhat limited [14, 29] (see Sect. 2). Our work complements these by discovering rules which the structural properties of social-affiliation graphs follow in relation to their attributes. Specifically, we study “attribute-induced subgraphs” (AIS, in short) – each of which is a subgraph induced by the nodes affiliated to a given attribute – substructures which connect the structure of friendship graph to the distribution of attribute values. See Sect. 3 for more details and Fig. 1 for an example. Studying the patterns exhibited by the structural properties of AIS allows us to understand homophily effects (‘birds of the same feather flock together’) and consider questions of form ‘If the number of users affiliated to attribute \(a\) doubles, what happens to the number of friendships between them?’ As we show later, the patterns discovered based on AIS and the associated capability to answer ‘what-if’ questions are subsequently useful in (i) detecting anomalies and (ii) developing and testing a realistic model for social-affiliation graphs.

Our contributions are two-fold: (a) Patterns: We study four large real-world social-affiliation graphs and discover three new consistent patterns concerning the structural properties of attribute-induced subgraphs. With the help of a case study, we illustrate how the findings can be leveraged for anomaly detection. (b) Model: We propose the SOAR model to produce synthetic social-affiliation graphs provably matching all observed patterns. SOAR is based on self-similarity, implicitly incorporates attribute correlations, scales linearly with graph size and is up to \(50\times \) faster than the prior models for social-affiliation graphs.

Reproducibility. We use publicly-available datasets and open-source our code at www.github.com/dhivyaeswaran/soar.

2 Related Work

We group related work into three categories: models for social networks with no attributes [A] and those for social-affiliation graphs when attributes are given [B] and not given [C].

Table 1. Comparison with other models for social-affiliation graphs

[A] Social graphs with no attributes. Several outstanding network models have been proposed to explain the observed structural characteristics of real-world non-attributed networks. Notably, the Barabási-Albert model for heavy-tail degree distributions [4], Forest Fire model for shrinking diameter [19], Butterfly model for the evolution of giant connected component [20], Kronecker model for community structure [18] and Random Typing Graph Model for self-similar temporal evolution [2]. Excellent surveys are given in [6, 13, 22]. As such, it is not clear how these models could be extended to produce attributes, given the complex interplay between attributes and friendship structure [9, 11, 25].

[B] Social-affiliation graphs when attributes are given. The problem of modeling network structure in the presence of known nodal attributes has been studied. Notably, Multiplicative Attribute Graph (MAG) model [15] connects nodes according to user-specified attribute-based link affinities. Attributed Graph Model (AGM) [24] presents a generic approach using an accept-reject sampling framework to augment a given non-attributed network model with correlated attributes. Both MAG and AGM apply to settings with categorical (not just binary) nodal attributes; however, they scale poorly with the number of attributes: each edge is sampled proportional to roughly the dot product of nodal attribute vectors, which is an expensive operation, considering that the social-affiliation graph datasets we study have around 30K to 1.28M affiliations.

[C] Social-affiliation graphs when attributes are not given. The simultaneous generation of attributes and friendships, in the context of social-affiliation graphs (i.e., with many binary attributes), has received some attention. The pioneering work by [29] discovers several patterns in social-affiliation graphs (e.g., power law relation between number of friends and average count of affiliations). It proposes Zhel model by adapting the non-attributed microscopic graph evolutionary model [17] for this setting. [14] studies the evolution of directed social network of Google+ and its affiliations, focusing on the density, diameter, degrees and clustering coefficients of users and affiliations. It proposes SAN model augmenting [17] with attribute-augmented preferential attachment and triangle-closing mechanisms to replicate the observations on Google+. The patterns we discover in this paper are complementary to the above discoveries. Further, both Zhel and SAN model the evolution of social-affiliation graphs, by generating attributes and edges of one node at a time, while in contrast, we investigate a one-shot approach to graph generation (i.e., without modeling its evolution) which leads to input parsimony and \(\sim \)50\(\times \) speed-up (see Sect. 5).

A qualitative comparison of social-affiliation graph models is given in Table 1.

Table 2. Frequently used symbols and their meanings
Fig. 1.
figure 1

(a) A social-affiliation graph with isSquare, isStriped attributes and (b) the subgraph induced by isSquare attribute

3 Preliminaries

Notation. Let \(\mathcal {G}= (\mathcal {V}, \mathcal {E}, \mathcal {A}, \mathcal {M})\) be a social-affiliation graph, where \(\mathcal {V}\) is the set of nodes (users), \(\mathcal {A}\) is the set of binary attributes (affiliationsFootnote 1), \(\mathcal {E}\) is the set of unweighted undirected who-is-friends-with-whom edges among nodes and \(\mathcal {M}\) is the set of who-is-affiliated-to-what attribute memberships between nodes and attributes. That is, if node \(u\) is connected to node \(u'\), then, \(\mathcal {E}\) includes edges \((u, u')\) and \((u', u)\); similarly, \((u, a) \in \mathcal {M}\) iff node \(u\) is affiliated with attribute \(a\). \(\mathcal {G}\) is equivalently expressed as a tuple \((\mathbf {A},\mathbf {F})\) of the \(n\times n\) symmetric adjacency matrix \(\mathbf {A}\) and the \(n\times k\) membership matrix \(\mathbf {F}\), where \(n= \left| \mathcal {V}\right| \) and \(k= \left| \mathcal {A}\right| \) denote the number of nodes and attributes respectively. The matrices are binary with 1 indicating the presence of an edge (in \(\mathbf {A}\)) or an attribute membership (in \(\mathbf {F}\)). Table 2 gives the frequently used notation.

Attribute-Induced Subgraph (AIS). Given a social-affiliation graph \(\mathcal {G}=(\mathcal {V},\mathcal {E},\mathcal {A},\mathcal {M})\), the attribute-induced subgraph \(\mathcal {G}_{a}\) corresponding to a given attribute \(a\in \mathcal {A}\) is obtained by selecting the nodes affiliated to attribute \(a\) and the edges which link two such nodes. Formally, \(\mathcal {G}_{a}= (\mathcal {V}_{a},\mathcal {E}_{a})\) where \(\mathcal {V}_{a}= \{u\in \mathcal {V}\mid (u,a) \in \mathcal {M}\}\) and \(\mathcal {E}_{a}= \{(u, u') \in \mathcal {E}\mid u, u'\in \mathcal {V}_a\}\). Let \(n_{a}= \left| \mathcal {V}_{a}\right| \) and \(m_{a}= \left| \mathcal {E}_{a}\right| \) denote its number of nodes and edges respectively. Triangle count \(\varDelta _{a}\) is the number of triangles in \(\mathcal {G}_{a}\) while spectral radius \(\sigma _{a}\) is the largest eigenvalue of its adjacency matrix. An example of an AIS is given in Fig. 1.

Datasets. We study four large publicly-available datasets, each of which contains a social network formed by friendship (or family) relations and also side-information regarding affiliations of users. Based on the nature of affiliations, we describe the datasets in two categories: (i) Online-affiliation networks: In Flickr [21] and YouTube [23], online photo-sharing and video-sharing websites respectively, users are allowed to form groups based on their common interests. We consider each group as a binary attribute, i.e., a user u has a group g if she participates in it. The friendship networks in these datasets are directed, but still, they have a high link symmetry or edge reciprocity [21]. Hence, for simplicity, we drop the direction of edges and retain a single copy of each resulting edge to get an undirected graph without multi-edges. (ii) Offline-affiliation networks: Brightkite and Gowalla datasets [8] contain undirected friendship network along with user check-in information, i.e., who visited where and when. We use each location as a binary attribute; a user u has a location attribute l if she has visited l at least once. For a detailed description of these datasets, we refer readers to the papers cited above. Some useful statistics are provided in Table 3. The next section details our pattern discoveries on these datasets.

Table 3. Social-affiliation graph datasets studied

4 Pattern Discoveries

Given an attribute-induced subgraph \(\mathcal {G}_{a}= (\mathcal {V}_{a},\mathcal {E}_{a})\), there is an infinite set of graph properties that one could investigate to look for patterns (number of nodes/edges, degree distributions, one or more eigenvalues, core number, etc.). Which ones should we focus on? Intuitively, we want to study properties that are (i) fundamental, easy to understand and interpret, (ii) fast to compute, exactly or approximately, in near-linear time in the number of edges and (iii) lead to prevalent patterns that AISs obey consistently across different datasets. After extensive experiments, we shortlist the following four properties of attribute-induced subgraphs: (i) \(n_{a}= \left| \mathcal {V}_{a}\right| \): number of nodes in \(\mathcal {G}_{a}\), i.e., number of users affiliated with attribute \(a\). (ii) \(m_{a}= \left| \mathcal {E}_{a}\right| \): number of edges in \(\mathcal {G}_{a}\), i.e., number of friendships among users affiliated with attribute \(a\). (iii) \(\varDelta _{a}\): number of triangles in \(\mathcal {G}_{a}\), typically indicative of the extent to which nodes in \(\mathcal {G}_{a}\) tend to cluster together (e.g., via clustering coefficient). (iv) \(\sigma _{a}\): spectral radius, or the principal eigenvalue of adjacency matrix of \(\mathcal {G}_{a}\), roughly indicative of how large and how dense the giant connected component in \(\mathcal {G}_{a}\) is. We list our observations regarding these properties in Sect. 4.1 and postpone explanations to Sect. 4.2.

4.1 Observations

Following standard terminology, we say that variables x and y obey a power law with exponent c, if \(y \propto x^c\) [1]. Our pattern discoveries are all power laws with non-negative (and usually non-integer) exponents, as stated below.

Observation 1

([P1] Edge count vs. node count). Edge count \(m_{a}\) and node count \(n_{a}\) of AISs obey a power law: \(m_{a}\propto n_{a}^\alpha ,\ 0 \le \alpha \le 2.\)

In the datasets we studied, \(\alpha \in [1.17,1.51]\). That is, double the nodes in an AIS, over double (roughly, triple) its edges.

Observation 2

([P2] Triangle count vs. node count). Triangle count \(\varDelta _{a}\) and node count \(n_{a}\) of AISs obey a power law: \( \varDelta _{a}\propto n_{a}^\beta , \ 0 \le \beta \le 3.\)

In the datasets we studied, \(\beta \in [1.24,1.96]\). That is, as the number of nodes in an AIS doubles, its triangle count becomes about 3–4 times larger.

Observation 3

([P3] Spectral radius vs. triangle count). Spectral radius \(\sigma _{a}\) and triangle count \(\varDelta _{a}\) of AISs obey a power law: \(\sigma _{a}\propto \varDelta _{a}^\gamma , \ \gamma \ge 0.\)

In the datasets we studied, \(\gamma \in [0.31,0.33]\). That is, doubling the spectral radius of an AIS leads to an eight-fold increase in its number of triangles.

Figure 2, which plots the relevant quantities (\(m_{a}\) vs. \(n_{a}\), \(\varDelta _{a}\) vs. \(n_{a}\) and \(\sigma _{a}\) vs. \(\varDelta _{a}\)), illustrates these observations. The cloud of gray points in these figures show values corresponding to various AISs and darker areas signify regions of higher density. The relevant exponents \(\alpha , \beta , \gamma \) are computed following standard practice (e.g., as in [16]). We bucketize x-axis logarithmically and compute per-bucket y averages (black triangles). The slope of the black line, which is the least-squares fit to the black triangles, gives the exponent. In addition, we report the Pearson correlation coefficient \(\rho \) of the per-bucket averages as a proxy for the goodness-of-fit of the power law relation. This value lies in [0, 1] and intuitively, the higher the value is, the better is the fit. In our experiments, \(\rho \) was consistently above 0.95, suggesting a near-perfect fit.

Fig. 2.
figure 2

Patterns exhibited by attribute-induced subgraphs (each point is an AIS)

4.2 Explanations, Use in Anomaly Detection, and Discussion

Here, we attempt to explain our observations in terms of known/expected properties of social-affiliation networks and hypothesize the nature of anomalies deviation from each pattern above would give rise to.

[P1] Edge count vs. node count. As the number of nodes in an AIS doubles, the number of edges remains the same (\(\alpha =0\)) for empty social-affiliation graphs having no edges and quadruples (\(\alpha =2\)) for complete graphs. As real-world social-affiliation networks tend to be sparse (\(\left| \mathcal {E}\right| = \mathcal {O}\left( \left| \mathcal {V}\right| \right) \)), one might expect the exponent \(\alpha \) to be roughly 1. However, in experiments, \(\alpha \) was much higher, e.g., \(\sim \)1.5 for Flickr dataset. This suggests homophily, i.e., more friendships among people sharing the same attributes, which causes the number of edges to more than double (in fact, triple) when the number of nodes is doubled. Attribute-induced graphs violating this pattern can be understood as unusually sparse or dense having too few/many friendships between users sharing an attribute, e.g., when no two people who go to Starbucks are friends with each other.

[P2] Triangle count vs. node count. As the number of nodes in an AIS doubles, triangle count remains the same (\(\beta =0\)) for empty or tree/star-like graphs with no triangles and becomes eight times (\(\beta =3\)) for fully connected graphs. In experiments, \(\beta \) was been 1 and 2; that is, the triangle count becomes 2–4 times when the node count doubles. This suggests that the AISs are neither stars nor cliques (as might ideally be expected based on homophily) but somewhere in between – consisting of several small stars, cliques and also possibly isolated nodes. Violations of this pattern can be understood as unusually non-clustered attribute-induced subgraphs (triangle-free, e.g., trees) or unusually clustered graphs (cliques). For example, it is suspicious if everyone who visits ‘ShadySide’ are friends with each other.

Fig. 3.
figure 3

Eigenvalues of 5 AISs with highest node counts from YouTube dataset

[P3] Spectral radius vs. triangle count. We know that the number of triangles in a graph is the sum of cubes of its adjacency’s eigenvalues [12]. Based on this, we provide two sufficient conditions for the observed slope of \(\gamma \approx 1/3\). Condition 1 (Dominating first eigenvalue): the first eigenvalue is much bigger than the rest; hence, triangle count of AISs are approximately the cube of their respective spectral radii (roughly, the number of triangles in their giant connected components, GCCs). Condition 2 (Power law eigenvalues): Lemma 1 provides an alternate explanation assuming exponents of eigenvalue power law distributions of all AISs are identical. Diving deeper into the eigenvalue vs. rank plots of AISs (see Fig. 3) reveals skewed eigenvalues distributions with similar slopes – suggesting that both reasons above are at play. Violations are due to attribute-induced subgraphs having unusually small or sparse or dense GCCs.

Lemma 1

(Spectral radius-triangle count power law). If \(s \) is the common exponent of power law eigenvalue distributions of the attribute-induced subgraphs for a given social-affiliation graph, their triangle count s \(\varDelta _{a}\) and spectral radii \(\sigma _{a}\) approximately obey \(\varDelta _{a}= \sigma _{a}^3\ \zeta (3s)\) where \(\zeta (\cdot )\) is the Riemann zeta function [27].

Proof

As the eigenvalues of adjacency matrices of all AISs follow a power law with exponent s, the \(i^{th}\) eigenvalue of any AIS is \(\sigma _{a}i^{-s}\), where \(\sigma _{a}\) is its spectral radius. Hence, triangle count \(\varDelta _{a}\), which is the sum of cubes of eigenvalues of the adjacency, is equal to \(\sum _i (\sigma _{a}i^{-s})^3 \approx \sigma _{a}^3\sum _{i=0}^{\infty }i^{-3s} = \sigma _{a}^3\ \zeta (3s)\), as desired.   \(\square \)

Anomaly Detection. Our pattern discoveries represent normal behavior of attributes in a social-affiliation graph, deviations from which can be flagged as anomalies. For example, the spectral radius vs. triangle count plot for YouTube yields a dense cloud of points mostly distributed along a straight line in log-log scales (Fig. 4a); the red triangle marks an exception due to an anomalous attribute. It turns out that, as expected, the deviation was due to its unusually sparse GCC, which consisted of a giant star plus a few triangles (see Fig. 4b for its Gephi visualization [5]). In contrast, a typical AIS with a comparable triangle count (green triangle in Fig. 4a) has a denser GCC (Fig. 4c).

Fig. 4.
figure 4

Anomaly detection using pattern [P3] reveals an attribute-induced subgraph (AIS) with an unusually sparse giant connected component (GCC) (Color figure online)

Discussion. It is natural to suppose that the data scraping methodology (sampling size/strategy) would have a considerable impact on the pattern discoveries. However, the consistency of our observations across datasets sampled in various ways – multiple sizes (Gowalla and Brightkite – almost whole public data; Flickr, YouTube – large fraction of the giant weakly connected component [8, 21]) and strategies (no sampling, snowball sampling using forward and/or reverse links depending on the public API) – suggest that the patterns are indeed generalize across many reasonable data scraping mechanisms. Also, note that our study is limited to the case of binary attributes; similar explorations of categorical and real-valued attributes are possible but left to future work.

5 SOAR Model

In this section, we show how to generate graphs which provably obey the discovered patterns using a coupled version of the matrix Kronecker product [26]. The resulting model, called SOAR– short for SOcial-Affiliation graphs via Recursion– has two steps: (i) an initiator graph \(\mathcal {G}_1\), consisting of carefully coupled initiator matrices \(\mathbf {A}_{1}\) for adjacency and \(\mathbf {F}_{1}\) for membership, is chosen; (ii) the initiator graph is recursively multiplied with itself via Coupled Kronecker Product (Definition 2) for a desired number of steps to obtain the final social-affiliation graph. Sect. 5.1 presents SOAR model in detail. Our important contribution here is the proof that Coupled Kronecker Product is a pattern-preserving operation, i.e., if the initiator graph obeys patterns P1–P3, so does the final graph (see Sect. 5.2).

5.1 Proposed SOAR Model

Recall from Sect. 3 that \(\mathcal {G}\) is a tuple \((\mathbf {A},\mathbf {F})\) of the \(n\times n\) symmetric adjacency matrix \(\mathbf {A}\) and the \(n\times k\) membership matrix \(\mathbf {F}\), where \(n=\left| \mathcal {V}\right| \) and \(k=\left| \mathcal {A}\right| \) denote the number of nodes and attributes respectively. Given an initiator social-affiliation graph \(\mathcal {G}_1 = (\mathbf {A}_{1}, \mathbf {F}_{1})\), where \(\mathbf {A}_{1}\) is the \(n_1\times n_1\) symmetric initiator matrix for adjacency and \(\mathbf {F}_{1}\) is the \(n_1\times k_1\) initiator matrix for membership, we propose to derive the final social-affiliation graph \(\mathcal {G}=(\mathbf {A},\mathbf {F})\) via the recursive equation:

$$\begin{aligned} \mathcal {G}_{t+1} = \mathcal {G}_t \ \bar{\otimes }\ \mathcal {G}_1 \end{aligned}$$
(1)

where \(\ \bar{\otimes }\ \) is the Coupled Kronecker Product, as defined below:

Definition 2

(Coupled Kronecker Product (CKP)). Given social-affiliation graphs \(\mathcal {G}_1=(\mathbf {A}_1, \mathbf {F}_1)\) and \(\mathcal {G}_2 = (\mathbf {A}_2, \mathbf {F}_2)\), their Coupled Kronecker Product is given by

$$\begin{aligned} \mathcal {G}_1 \ \bar{\otimes }\ \mathcal {G}_2 = (\mathbf {A}_1\otimes \mathbf {A}_2, \mathbf {F}_1\otimes \mathbf {F}_2) \end{aligned}$$
(2)

where \(\otimes \) is the matrix Kronecker product.

After \(M \) steps of Eq. (1), we obtain a \(n\times n\)-dim \(\mathbf {A}_{M}\) and a \(n\times k\)-dim \(\mathbf {F}_{M}\) where \(n=n_1^M \) and \(k=k_1^M \) respectively. When the initiator matrices are binary, so are the final matrices and thus can be directly used as the adjacency \(\mathbf {A}\) and membership \(\mathbf {F}\) matrices, respectively. It turns out that the above process captures the required power laws but has several discrete jumps (fluctuations). Hence, we use the stochastic version below.

The main idea is to produce at every recursive step, matrices of edge/membership occurrence probabilities instead of discrete (binary) edges/memberships. Thus, we begin with initiator matrices having real number entries in [0, 1] (they do not need to sum to 1) and add a small relative noise \(\eta \) to the initiator matrices independently at every recursive step t. This process results in the final dense probability matrices \(\mathbf {A}_{M}\) and \(\mathbf {F}_{M}\), from which we recover \(\mathbf {A}\) and \(\mathbf {F}\) by sampling each entry proportional to its final value. A scalable implementation of the above approach by sampling one edge or membership at a time is given in Algorithm 1. The Hadamard product \(\odot \) in lines 6 and 8 performs an element-wise matrix multiplication to add the desired noise to the initiators.

figure a

Running Time Analysis. Initialization (ln 1–11) contributes a fixed overhead of \(\mathcal {O}(M (n_1^2 + n_1k_1))\). The generation of edges (ln 12–20) and memberships (ln 21–29) take \(\mathcal {O}\left( n_1^2M \right) \) per edge and \(\mathcal {O}\left( n_1k_1M \right) \) per membership respectively. As \(n_1,k_1\) and \(M \) are small in practice (<10), Algorithm 1 is linear in the number of edges and attribute memberships.

5.2 Theoretical Properties

The structural properties of graphs generated using Kronecker product are well-studied and a number of desirable properties have been proved, e.g., multinomial distribution of degrees and singular values, etc. [18]. These properties directly carry over to the proposed model. More surprisingly, for careful coupling of initiators, SOAR graphs provably obey all the discovered power laws from Sect. 4. This is due to the pattern-preserving property of the Coupled Kronecker Product operation. That is, if graphs \(\mathcal {G}_1\) and \(\mathcal {G}_2\) obey the patterns P1–P3 with the same exponent, then, so does their Coupled Kronecker Product \(\mathcal {G}_1\ \bar{\otimes }\ \mathcal {G}_2\). This is stated in Lemmas 35 (proofs in appendix).

Lemma 3

(CKP preserves [P1]). If \(\mathcal {G}_1\) and \(\mathcal {G}_2\) obey the edge count vs. node count power law with exponent \(\alpha \), i.e., \(m_{a}\propto n_{a}^\alpha \), so does \(\mathcal {G}_1\ \bar{\otimes }\ \mathcal {G}_2\).

Lemma 4

(CKP preserves [P2]). If \(\mathcal {G}_1\) and \(\mathcal {G}_2\) obey the triangle count vs. node count power law with exponent \(\beta \), i.e., \(\varDelta _{a}\propto n_{a}^\beta \), so does \(\mathcal {G}_1\ \bar{\otimes }\ \mathcal {G}_2\).

Lemma 5

(CKP preserves [P3]). If \(\mathcal {G}_1\) and \(\mathcal {G}_2\) obey the spectral radius vs. triangle count power law with exponent \(\gamma \), i.e., \(\sigma _{a}\propto \varDelta _{a}^\gamma \), so does \(\mathcal {G}_1\ \bar{\otimes }\ \mathcal {G}_2\).

The proofs, given in appendix, use the properties of matrix Kronecker product [26] and two key observations: (1) edge count, node count, triangle count and spectral radius of AIS for an attribute \(a\) are explicit algebraic functions of the adjacency matrix \(\mathbf {A}\) and the column in \(\mathbf {F}\) which corresponds to \(a\); (2) each column in \(\mathbf {F}_1\otimes \mathbf {F}_2\) is the Kronecker product of a column in \(\mathbf {F}_1\) and a column in \(\mathbf {F}_2\). Given this, our main result is:

Theorem 6

(SOAR graphs provably obey patterns P1–P3). If \(\mathcal {G}_1 = (\mathbf {A}_{1}, \mathbf {F}_{1})\) obeys patterns P1–P3 with exponents \(\alpha , \beta \) and \(\gamma \) respectively, then \(\mathcal {G}= \textsc {SOAR} (\mathbf {A}_{1}, \mathbf {F}_{1}, M, \eta =0)\) also obeys P1–P3, with the same exponents \(\alpha , \beta \) and \(\gamma \).

Proof

We prove this using induction on the number of steps \(t=1,\ldots ,M \). It is given that \(\mathcal {G}_1\) follows P1–P3, hence the base case for \(t=1\) is true. Now suppose for \(1\le t<M \), \(\mathcal {G}_t\) follows P1–P3. Then, using Lemmas 3, 4 and 5, \(\mathcal {G}_t\ \bar{\otimes }\ \mathcal {G}_1 = \mathcal {G}_{t+1}\) follows P1–P3. Thus, by induction, \(\mathcal {G}= \mathcal {G}_{M}\) obeys P1–P3.    \(\square \)

Although Theorem 6 assumes no noise, it can be easily extended to the stochastic version of the SOAR generator to give similar guarantees in expectation. Our simulation studies, presented in Sect. 5.3, confirm our theoretical results.

Discussion. We elaborate on various aspects of the proposed SOAR model. (a) Input parsimony: SOAR, belonging to the paradigm of one-shot graph generation, has only four knobs to set: two (small) initiator matrices \((\mathbf {A}_{1}, \mathbf {F}_{1})\), number of recursive steps \(M \) and noise level \(\eta \). In contrast, evolutionary models typically need knobs for node-arrival, lifetime, sleep-time and linking processes (e.g., [29]). (b) Attribute correlations: SOAR implicitly incorporates attribute correlations, as Kronecker product naturally leads to recursive community structure [18]. Contrast this with [24] which explicitly models attribute correlations. (c) Parameter fitting: Given a social-affiliation network \(\mathcal {G}= (\mathbf {A}, \mathbf {F})\), its parameters for SOAR model can be learned by employing KronFit [18] for \(\mathbf {A}\) and \(\mathbf {F}\) separately. (d) Parameter selection: To create social-affiliation graphs with homophily, we recommend choosing initiators such that the entries of \(\mathbf {F}_{1}\mathbf {F}_{1}^T\) are correlated with those of \(\mathbf {A}_{1}\). Intuitively, this ensures that nodes with similar attributes are linked in the initiator and the self-similarity of Kronecker product passes this property on to the final graph.

5.3 Simulation Studies

We compare SOAR to two representative baselines – AGM [24] and SAN [14] – which were the most recent works in categories [B] and [C] from Sect. 2. Quantitative experiments compare the time taken by the models to generate graphs of comparable sizes, while qualitative experiments verify whether the models are able to generate graphs obeying the three discovered patterns – [P1] Edge count vs. node count power law relation, [P2] Triangle count vs. node count power law relation, [P3] Spectral radius vs. triangle count power law relation – as well as the following well-known properties: [P4] Skewed distributionsFootnote 2 of \(\#\)friends per node (node degree), \(\#\)attributes per node (attribute degree of node) and \(\#\)nodes per attribute (AIS node count) [29], [P5] Skewed distribution of eigenvalues of adjacency matrix [7].

We use the open-sourced code for SAN as is, but adapted AGM to get a skewed distribution of \(\#\)nodes per attribute (i.e., group size [29]) and subsequently generated edges using the default Fast Chung Lu model. For SOAR, we use initiators from Fig. 6a–b (observe the correlation between \(\mathbf {F}_{1}\mathbf {F}_{1}^T\) and \(\mathbf {A}_{1}\)) replacing \(1\rightarrow 0.6, 0\rightarrow 10^{-4}\) for stochasticity (and scaling the remaining entries appropriately), recursive steps \(M =8\) and noise level \(\eta =0.5\). This yields a graph with 0.4M nodes, 5.6M edges, 65K attributes and 2M attribute memberships.

Fig. 5.
figure 5

Speed and scalability.

Quantitative Evaluation. Figure 5 compares generation time of SOAR vs. SAN for five different graph sizes (AGM, due to the explicit enforcing of attribute correlations, scaled poorly with #attributes). Running times are averaged over 10 runs and experiments were performed on Mac OSX Yosemite with 2.7 GHz Intel i5 core and 16 GB main memory. We find that SOAR scales linearly, i.e., slope \(\approx \)1 in log-log scale. SAN also shows the desired linear scalability, but was \(50\times \) slower for \(\sim \)1M edges plus memberships.

Qualitative Evaluation. From Figs. 6 and 7, we observe that only the proposed SOAR model is able to generate graphs obeying all these five patterns (Fig. 6), whereas the baselines fail at least one of them (Fig. 7a). In the interest of space, we show only one failed pattern per baseline: AGM leads to very low triangle count for AIS, perhaps due to its undesirably high importance to attribute correlation and homophily, which leads to few edges between nodes sharing attributes when the number of attributes is large (Fig. 7b); SAN produces an almost flat eigenvalue distribution (excluding first three values), likely due to the underlying preferential attachment model (Fig. 7c).

Fig. 6.
figure 6

SOAR generates realistic graphs: initiators in (a–b) lead to the discovered patterns P1–P3 (c–e) and skewed degree and eigenvalue distributions P4–P5 (f–g).

Fig. 7.
figure 7

(a) Graphs generated by baselines (AGM, SAN) disobey at least one pattern, e.g., (b) [P3] of AGM and (c) [P5] of SAN. Here, denotes empirical adherence based on a few parameters, while indicates theoretical adherence as well. (Color figure online)

In sum, our simulations demonstrate that SOAR is able to generate social-affiliation graphs obeying all observed patterns in a fast and scalable manner.

6 Conclusion

We investigated the problem of pattern analysis and modeling of social-affiliation graphs – a friendship graph where users have many, binary attributes e.g., check-ins, page likes or group memberships – with the help of four large publicly-available real-world datasets. Our contributions are: (i) Patterns: We discovered three new consistent patterns concerning the structural properties of attribute-induced subgraphs and illustrated how the findings can be leveraged for anomaly detection. (ii) Model: We proposed SOAR model to produce synthetic social-affiliation graphs provably matching all observed patterns. It is based on the principle of self-similarity, implicitly incorporates attribute correlations, scales linearly with graph size and is up to \(50\times \) faster than the currently available generators for social-affiliation graphs. Our code is open-sourced at www.github.com/dhivyaeswaran/soar. Similar exploration of node-attributed graphs with categorical/real-valued attributes is a valuable direction for future work.