Skip to main content
Log in

Effective and efficient attributed community search

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Given a graph G and a vertex \(q \in G\), the community search query returns a subgraph of G that contains vertices related to q. Communities, which are prevalent in attributed graphs such as social networks and knowledge bases, can be used in emerging applications such as product advertisement and setting up of social events. In this paper, we investigate the attributed community query (or ACQ), which returns an attributed community (AC) for an attributed graph. The AC is a subgraph of G, which satisfies both structure cohesiveness (i.e., its vertices are tightly connected) and keyword cohesiveness (i.e., its vertices share common keywords). The AC enables a better understanding of how and why a community is formed (e.g., members of an AC have a common interest in music, because they all have the same keyword “music”). An AC can be “personalized”; for example, an ACQ user may specify that an AC returned should be related to some specific keywords like “research” and “sports”. To enable efficient AC search, we develop the CL-tree index structure and three algorithms based on it. We further propose efficient algorithms for maintaining the index on dynamic graphs. Moreover, we study two problems that are related to the ACQ problem. We evaluate our solutions on six large graphs. Our results show that ACQ is more effective and efficient than existing community retrieval approaches. Moreover, an AC contains more precise and personalized information than that of existing community search and detection methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24

Similar content being viewed by others

Notes

  1. URL of the SDSS project: http://www.sdss.org.

  2. In practice, the query user can be alerted by the system when there is no sharing among the vertices.

  3. We use “node” to mean “CL-tree node” in this paper.

  4. https://www.flickr.com/.

  5. http://dblp.uni-trier.de/xml/.

  6. http://www.kddcup2012.org/c/kddcup2012-track1.

  7. http://dbpedia.org/datasets.

References

  1. Bahmani, B., Kumar, R., Mahdian, M., Upfal, E.: Pagerank on an evolving graph. In: KDD, pp. 24–32 (2012)

  2. Barbieri, N., Bonchi, F., Galimberti, E., Gullo, F.: Efficient and effective community search. DMKD 29(5), 1406–1433 (2015)

    Google Scholar 

  3. Batagelj, V., Zaversnik, M.: An o(m) algorithm for cores decomposition of networks. (2003). Preprint. arXiv:cs/0310049

  4. Cui, W., Xiao, Y., Wang, H., Lu, Y., Wang W.: Online search of overlapping communities. In: SIGMOD, pp. 277–288 (2013)

  5. Cui, W., Xiao, Y., Wang, H., Wang, W.: Local search of communities in large graphs. In: SIGMOD, pp. 991–1002 (2014)

  6. Ding, B., Yu, J.X., Wang, S., Qin, L., Zhang, X., Lin, X.: Finding top-k min-cost connected trees in databases. In: ICDE (2007)

  7. Dorogovtsev, S.N., Goltsev, A.V., Mendes, J.F.F.: K-core organization of complex networks. Phys. Rev. Lett. 96(4), 040601 (2006)

    Article  MATH  Google Scholar 

  8. Fan, W., Li, J., Ma, S., Tang, N., Wu, Y., Wu, Y.: Graph pattern matching: from intractable to polynomial time. Proc. VLDB Endow. 3(1–2), 264–275 (2010)

    Article  Google Scholar 

  9. Fang, Y., Cheng, R., Luo, S., Hu, J.: Effective community search for large attributed graphs. PVLDB 9(12), 1233–1244 (2016)

    Google Scholar 

  10. Fang, Y., Cheng, R., Luo, S., Hu, J., Huang, K.: C-explorer: browsing communities in large graphs. PVLDB 10(12), 1885–1888 (2017)

    Google Scholar 

  11. Fang, Y., Cheng, R., Li, X., Luo, S., Hu, J., Hu, J.: Effective community search over large spatial graphs. PVLDB 10(6), 709–720 (2017)

  12. Fang, Y., Zhang, H., Ye, Y., Li, X.: Detecting hot topics from twitter: a multiview approach. J. Inf. Sci. 40(5), 578–593 (2014)

    Article  Google Scholar 

  13. Fortunato, S.: Community detection in graphs. Phys. Rep. 486(3), 75–174 (2010)

    Article  MathSciNet  Google Scholar 

  14. Giatsidis, C., Thilikos, D.M., Vazirgiannis, M.: D-cores: measuring collaboration of directed graphs based on degeneracy. In: ICDM, pp. 201–210. IEEE (2011)

  15. Han, J., Kamber, M., Pei. J.: Data Mining: Concepts and Techniques. Elsevier, Amsterdam (2011)

  16. Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: SIGMOD (2000)

  17. He, H., Wang, H., Yang, J., Yu, P.S.: Blinks: ranked keyword searches on graphs. In: SIGMOD (2007)

  18. https://en.wikipedia.org/wiki/Disjoint-set_data_structure

  19. Hu, J., Wu, X., Cheng, R., Luo, S., Fang, Y.: Querying minimal steiner maximum-connected subgraphs in large graphs. In: CIKM, pp. 1241–1250 (2016)

  20. Hu, J., Wu, X., Cheng, R., Luo, S., Fang, Y.: On minimal steiner maximum-connected subgraph queries. In: TKDE (2017)

  21. Huang, X., Cheng, H., Qin, L., Tian, W., Yu, J.X.: Querying k-truss community in large and dynamic graphs. In: SIGMOD (2014)

  22. Huang, X., Lakshmanan, L.V., Yu, J.X., Cheng, H.: Approximate closest community search in networks. Proc. VLDB Endow. 9(4), 276–287 (2015)

    Article  Google Scholar 

  23. Kacholia, V., et al.: Bidirectional expansion for keyword search on graph databases. In: VLDB (2005)

  24. Kargar, M., An, A.: Keyword search in graphs: finding r-cliques. PVLDB 4(10), 681–692 (2011)

    Google Scholar 

  25. Li, R.-H., Qin, L., Yu, J.X., Mao, R.: Influential community search in large networks. In: PVLDB (2015)

  26. Li, R.-H., Yu, J.X., Mao, R.: Efficient core maintenance in large dynamic graphs. TKDE 26, 2453–2465 (2014)

    Google Scholar 

  27. Liu, Y., Niculescu-Mizil, A., Gryc, W.: Topic-link lda: joint models of topic and author community. In: ICML (2009)

  28. Mislove, A.: Online social networks: measurement, analysis, and applications to distributed information systems. Ph.D. thesis, Rice University, Department of Computer Science (2009)

  29. Mislove, A., Koppula, H.S., Gummadi, K.P., Druschel, P., Bhattacharjee, B.: Growth of the flickr social network. In: Proceedings of the 1st ACM SIGCOMM Workshop on Social Networks (WOSN’08) (2008)

  30. Nallapati, R.M., Ahmed, A., Xing, E.P., Cohen, W.W.: Joint latent topic models for text and citations. In: KDD (2008)

  31. Newman, M.E.J., Girvan, M.: Finding and evaluating community structure in networks. Phys. Rev. E 69(2), 026113 (2004)

    Article  Google Scholar 

  32. Qi, G.-J., Aggarwal, C.C., Huang, T.S.: Online community detection in social sensing. In: WSDM, pp. 617–626. ACM (2013)

  33. Ren, C., Lo, E., Kao, B., Zhu, X., Cheng, R.: On querying historical evolving graph sequences. VLDB 4(11), 726–737 (2011)

    Google Scholar 

  34. Ruan, Y., Fuhry, D., Parthasarathy, S.: Efficient community detection in large networks using content and links. In: WWW (2013)

  35. Sachan, M., et al.: Using content and interactions for discovering communities in social networks. In: WWW (2012)

  36. Sarıyüce, A.E., Gedik, B., Jacques-Silva, G., Wu, K.-L., Çatalyürek, Ü.V.: Incremental k-core decomposition: algorithms and evaluation. VLDB J. 25(3), 425–447 (2016)

    Article  Google Scholar 

  37. Seidman, S.B.: Network structure and minimum degree. Soc. Netw. 5(3), 269–287 (1983)

    Article  MathSciNet  Google Scholar 

  38. Sozio, M., Gionis, A.: The community-search problem and how to plan a successful cocktail party. In: KDD (2010)

  39. Subbian, K., Aggarwal, C.C., Srivastava, J., Yu, P.S.: Community detection with prior knowledge. In: SDM (2013)

  40. Thomee, B., et al.: The new data and new challenges in multimedia research. (2015). arXiv:1503.01817

  41. Tong, H., Faloutsos, C., Gallagher, B., Eliassi-Rad, T.: Fast best-effort pattern matching in large attributed graphs. In: KDD (2007)

  42. Xu, Z., Ke, Y., Wang, Y., Cheng, H., Cheng, J.: A model-based approach to attributed graph clustering. In: SIGMOD (2012)

  43. Yang, J., McAuley, J., Leskovec, J.: Community detection in networks with node attributes. In: ICDM, pp. 1151–1156 (2013)

  44. Yang, T., Jin, R., Chi, Y., Zhu, S.: Combining link and content for community detection: a discriminative approach. In: KDD (2009)

  45. Yu, J.X., Qin, L., Chang, L.: Keyword search in databases. Synth. Lect. Data Manag. 1, 1–155 (2009)

    Article  MATH  Google Scholar 

  46. Zhou, Y., Cheng, H., Yu, J.X.: Graph clustering based on structural/attribute similarities. Proc. VLDB Endow. 2(1), 718–729 (2009)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yixiang Fang.

Appendices

A Proofs of lemmas

Lemma 1

(Anti-monotonicity)  1 Given a graph G, a vertex \(q\in G\) and a set S of keywords, if there exists a subgraph \(G_k[S]\), then there exists a subgraph \(G_k[S']\) for any subset \(S'\subseteq S\).

Proof

Based on the definition of \(G_k[S]\), each vertex of \(G_k[S]\) contains S. Consider a new keyword set \(S'\subseteq S\). We can easily conclude that, each vertex of \(G_k[S]\) contains \(S'\) as well. Also, note that \(q\in G_k[S]\). These two properties imply that there exists one subgraph of G, namely \(G_k[S]\), with core number at least k, such that it contains q and every vertex of it contains keyword set \(S'\). It follows that there exists such a subgraph with maximal size (i.e., \(G_k[S']\)).\(\square \)

Proposition 1

For any keyword set S, and vertex q, if \(G_k[S]\) exists, then \(G_k[S]\subseteq G_k[S']\) for any subset \(S'\subseteq S\).

Proof

Since \(G_k[S]\) contains vertex q and every vertex in \(G_k[S]\) contains \(S'\) (due to \(S'\subseteq S\)), then \(G_k[S]\cup G_k[S']\) also contains vertex q and every vertex in it contains \(S'\). In addition, the core numbers of \(G_k[S]\) and \(G_k[S']\) are at least k, it follows that the core number of \(G_k[S]\cup G_k[S']\) is at least k. Based on the definition of \(G_k[S']\), we have \(G_k[S]\cup G_k[S']\subseteq G_k[S']\). It follows that \(G_k[S]\subseteq G_k[S']\).\(\square \)

Lemma 2

Given two subgraphs \(G_k[S_1]\) and \(G_k[S_2]\) of a graph G, for a new keyword set \(S'\) generated from \(S_1\) and \(S_2\) (i.e., \(S'=S_1\cup S_2\)), if \(G_k[S']\) exists, then it must appear in a k-\(\widehat{core}\) with core number at least

$$\begin{aligned} max\{core_G[G_k[S_1]], core_G[G_k[S_2]]\}. \end{aligned}$$
(5)

Proof

Since \(S'\) is generated from \(S_1\) and \(S_2\), then \(S_1\subseteq S'\) and \(S_2 \subseteq S'\). Based on Proposition 1, we have \(G_k[S']\subseteq G_k[S_1]\). With such a containment relationship, it follows that \(min\{core_G[v]|\) \(v\in G_k[S_1]\}\le min\{core_G[v]|v\in G_k[S']\}\). Hence, the core number of \(G_k[S']\) is at least the core number of \(G_k[S_1]\). Formally, \(core_G[G_k[S_1]]\) \(\le core_G[G_k[S']]\). Similarly, \(core_G[G_k[S_2]]\le core_G[G_k[S']]\). It directly follows the lemma.\(\square \)

Lemma 3

Given a connected graph G(VE) with \(n=|V|\) and \(m=|E|\), if \(m - n < \frac{{{k^2} - k}}{2} - 1\), there is no k-\(\widehat{core}\) in G.

Proof

From Definition 1, we can easily conclude that, for any specific k, a k-\(\widehat{core}\) has at least \(k+1\) vertices. Since each vertex in a specific k-\(\widehat{core}\) has at least k edges, the minimum number of edges in a k-\(\widehat{core}\) is \(\frac{{(k + 1)k}}{2}\).

Consider a connected graph, which contains a k-\(\widehat{core}\) and has the minimum number of edges, where the k-core contains only \(k+1\) vertices and all the rest \(n-(k+1)\) vertices are connected with this k-\(\widehat{core}\). The total number of edges is

$$\begin{aligned} \frac{{(k + 1)k}}{2} + \left[ {n - (k + 1)} \right] = m \end{aligned}$$
(6)

By simple transformation, we can conclude that, if m\(n < \frac{{{k^2} - k}}{2} - 1\), there is no k-\(\widehat{core}\) in G.\(\square \)

Lemma 4

Given two keyword sets \(S_1\) and \(S_2\), if \(G_k[S_1]\) and \(G_k[S_2]\) exist, we have

$$\begin{aligned} G_k[S_1\cup S_2] \subseteq G_k[S_1]\cap G_k[S_2]. \end{aligned}$$
(7)

Proof

Based on Proposition 1 and \(S_1\subseteq {S_1} \cup {S_2}\), we have \({G_k}[{S_1} \cup {S_2}]\subseteq {G_k}[{S_1}]\). For the same reason, we have \({G_k}[{S_1} \cup {S_2}]\subseteq {G_k}[{S_2}]\). It directly follows the lemma.\(\square \)

Lemma 6

After inserting an edge between two vertices, the maximum number of disconnected k-\(\widehat{core}\)s which need to be merged is 2.

Proof

We prove the lemma by contradiction. Consider a k-core with 3 disconnected k-\(\widehat{core}\)s \(G_1\), \(G_2\), and \(G_3\) and \(u\in G_1\), \(v\in G_2\), \(w\in G_3\). Let (u, v) be the newly inserted edge that triggers merging \(G_1\) and \(G_2\). Suppose \(G_3\) is also affected by the insertion that needs to be merged with \(G_1\) and \(G_2\). Then there must exist one connected path in the form (w, \(\cdots \), u, \(\cdots \), v). Since (u, v) is the only inserted edge, to enables the above path connected, we can claim that w can already reach to u or v in some paths before insert (u, v). That means \(G_3\) is connected to \(G_1\) or \(G_2\) before the edge insertion and either case is contradictory to the assumption. Hence, the lemma holds.\(\square \)

Lemma 7

In the process of merging subtrees, the maximum number of nodes which need to be merged in each level is 2.

Proof

It can be proved in the similar way as that of Lemma 6.\(\square \)

B Basic solutions for ACQ

Algorithms 14 presents basic-g. The input of basic-g is a graph G, a query vertex q, an integer k, and a set S. It first initializes a set, \(\varPsi \), of candidate keyword sets with each being a keyword of S (line 2). Then, it finds the k-\(\widehat{core},\,{\mathcal C}_k\), containing q from the graph G. In the loop (lines 4–11), it first initializes an empty set \(\varPhi \) (line 5) for collecting all the qualified keyword sets. Then for each \(S'\in \varPsi \), it finds \(G_k[S']\) from \({\mathcal C}_k\) by considering the keyword and degree constraints, and put it into \(\varPhi \) if \(G_k[S']\) exists (lines 6–8). After checking all the candidate keyword sets in \(\varPsi \), if there are at least one qualified keyword sets in \(\varPhi \), it generates a new set \(\varPsi \) of candidate keyword sets by calling geneCand(\(\varPhi \)) and continues to checking larger candidate keyword sets in next loop; otherwise, it stops and outputs the ACs (lines 9–11).

figure n

The other basic algorithm basic-w has the same steps of basic-g, except that for each candidate keyword set \(S'\), it finds \(G_k[S']\) from G, rather than \({\mathcal C}_k\). We skip the pseudocodes due to the space limitation.

C Basic algorithms for ACQ-A and ACQ-M

1. ACQ-A We show basic-g-v1 in Algorithm 15. The other algorithm basic-w-v1 has the same steps of basic-g-v1, except that it finds \(G_k[S]\) from G, rather than \({\mathcal C}_k\).

figure o

2. ACQ-M We show basic-g-v2 in Algorithm 16. The other algorithm basic-w-v2 has the same steps of basic-g-v2, except that in line 4 of basic-g-v2, it uses basic-w.

figure p

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fang, Y., Cheng, R., Chen, Y. et al. Effective and efficient attributed community search. The VLDB Journal 26, 803–828 (2017). https://doi.org/10.1007/s00778-017-0482-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-017-0482-5

Keywords

Navigation