Skip to main content
Log in

Efficient distributed subgraph similarity matching

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Given a query graph \(q\) and a data graph \(G\), subgraph similarity matching is to retrieve all matches of \(q\) in \(G\) with the number of missing edges bounded by a given threshold \(\epsilon \). Many works have been conducted to study the problem of subgraph similarity matching due to its ability to handle applications involved with noisy or erroneous graph data. In practice, a data graph can be extremely large, e.g., a web-scale graph containing hundreds of millions of vertices and billions of edges. The state-of-the-art approaches employ centralized algorithms to process the subgraph similarity queries, and thus, they are infeasible for such a large graph due to the limited computational power and storage space of a centralized server. To address this problem, in this paper, we investigate subgraph similarity matching for a web-scale graph deployed in a distributed environment. We propose distributed algorithms and optimization techniques that exploit the properties of subgraph similarity matching, so that we can well utilize the parallel computing power and lower the communication cost among the distributed data centers for query processing. Specifically, we first relax and decompose \(q\) into a minimum number of sub-queries. Next, we send each sub-query to conduct the exact matching in parallel. Finally, we schedule and join the exact matches to obtain final query answers. Moreover, our workload-balance strategy further speeds up the query processing. Our experimental results demonstrate the feasibility of our proposed approach in performing subgraph similarity matching over web-scale graph data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28
Fig. 29
Fig. 30
Fig. 31
Fig. 32
Fig. 33
Fig. 34
Fig. 35
Fig. 36
Fig. 37
Fig. 38

Similar content being viewed by others

Notes

  1. We assume \(q(v)>0\) in the following of this paper.

References

  1. Afrati, F.N., Fotakis, D., Ullman, J.D.: Enumerating subgraph instances using map-reduce. In: ICDE (2013)

  2. Aggarwal, C., Wang, H.: Managing and Mining Graph Data. Springer, Berlin (2010)

    Book  MATH  Google Scholar 

  3. Andreev, K., Racke, H.: Balanced graph partitioning. Theory Comput. Syst. 39(6), 929–939 (2006)

    Article  MATH  MathSciNet  Google Scholar 

  4. Chakrabarti, D., Zhan, Y., Faloutsos, C.: R-mat: a recursive model for graph mining. In: SDM, vol. 4, pp. 442–446. SIAM (2004)

  5. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms. MIT Press, Cambridge (2001)

    MATH  Google Scholar 

  6. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  7. Fan, W., Wang, X., Wu, Y.: Performance guarantees for distributed reachability queries. In: VLDB, pp. 1304–1316 (2012)

  8. Gao, X., Xiao, B., Tao, D., Li, X.: A survey of graph edit distance. Pattern Anal. Appl. 13(1), 113–129 (2010)

    Article  MathSciNet  Google Scholar 

  9. Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, San Francisco (1979)

    MATH  Google Scholar 

  10. Hochbaum, D. (ed.) Approximation Algorithms for NP-Hard Problems. PWS (1997)

  11. http://research.microsoft.com/en-us/projects/trinity/

  12. http://www.facebook.com/press/info.php?statistics

  13. http://www.w3.org/

  14. http://www.worldwidewebsize.com/

  15. Kang, U., Tsourakakis, C.E.: Pegasus: a peta-scale graph mining system implementation and observations. In: ICDM (2009)

  16. Kwak, H., Lee, C., Park, H., Moon, S.B.: What is twitter, a social network or a news media? In: WWW, pp. 591–600 (2010)

  17. Ma, S., Cao, Y., Huai, J., Wo, T.: Distributed graph pattern matching. In: WWW, pp. 949–958. ACM (2012)

  18. Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: SIGMOD, pp. 135–146. ACM (2010)

  19. Papadias, D., Tao, Y., Fu, G., Seeger, B.: An optimal and progressive algorithm for skyline queries. In: SIGMOD (2003)

  20. Plantenga, T.: Inexact subgraph isomorphism in mapreduce. J. Parallel Distrib. Comput. 73(2), 164–175 (2013)

    Article  Google Scholar 

  21. Shang, Z., Yu, J.X.: Catch the wind: graph workload balancing on cloud. In: ICDE, pp.553–564 (2013)

  22. Shang, H., Zhu, K., Lin, X., Zhang, Y., Ichise, R.: Similarity search on supergraph containment. In: Proceedings of ICDE, pp. 637–648 (2010)

  23. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: MSST, pp. 1–10. IEEE (2010)

  24. Srivatsa, M., Kawadia, V., Yang, S.: Distributed graph query processing in dynamic networks. In: Pervasive Computing and Communications Workshops (PERCOM Workshops), 2012 IEEE International Conference on, pp. 20–25. IEEE (2012)

  25. Stanton, I., Kliot, G.: Streaming graph partitioning for large distributed graphs. In: KDD, pp. 1222–1230. ACM (2012)

  26. Sun, Z., Wang, H., Shao, B., Wang, H., Li, J.: Efficient subgraph matching on billion node graphs. In: VLDB (2012)

  27. Ozsu, M.T., Valduriez, P.: Principles of Distributed Database Systems. Springer, Berlin (2011)

    Google Scholar 

  28. Yan, X., Yu, P.S., Han, J.: Substructure similarity search in graph databases. In: Proceedings of SIGMOD, pp. 766–777 (2005)

  29. Yang, S., Yan, X., Zong, B., Khan, A.: Towards effective partition management for large graphs. In: SIGMOD, pp. 517–528 (2012)

  30. Yuan, Y., Wang, G., Chen, L., Wang, H.: Efficient subgraph similarity search on large probabilistic graph databases. In: Proceedings of VLDB, pp. 800–811 (2012)

  31. Yuan, Y., Wang, G., Chen, L., Wang, H.: Graph similarity search on large uncertain graph databases. VLDB J. pp. 1–26 (2014)

  32. Yuan, Y., Wang, G., Wang, H., Chen, L.: Efficient subgraph search over large uncertain graphs. In: Proceedings of VLDB, pp. 876–886 (2011)

  33. Yuan, Y., Wang, G., Chen, L., Wang, H.: Efficient keyword search on uncertain graph data. TKDE 25(12), 2767–2779 (2013)

    Google Scholar 

  34. Zeng, Z., Tung, A.K.H., Wang, J., Zhou, L., Feng, J.: Comparing stars: on approximating graph edit distance. In: VLDB (2009)

  35. Zeng, K., Yang, J., Wang, H., Shao, B., Wang, Z.: A distributed graph engine for web scale rdf data. In: VLDB (2013)

  36. Zhang, S., Yang, J., Jin, W.: Sapper: subgraph indexing and approximate matching in large graphs. In: VLDB (2010)

  37. Zhao, P., Han, J.: On graph query optimization in large networks. Proc. VLDB Endow. 3(1–2), 340–351 (2010)

    Article  Google Scholar 

  38. Zhu, G., Lin, X., Zhu, K., Zhang, W., Yu, J.X.: Treespan: efficiently computing similarity all-matching. In: SIGMOD (2012)

Download references

Acknowledgments

This work is supported in part by the NSFC (Grant No. 61100024, 61332006, U1401256), the Fundamental Research Funds for the Central Universities (Grant No. N130504006), the National Basic Research Program of China (973, Grant No. 2011CB302200-G), the Research Grants Council of the Hong Kong SAR, China (Grant No. 14209314 and 418512), the NSFC (Grant No. 61328202), the Hong Kong RGC Project N HKUST637/13, the National Grand Fundamental Research 973 Program of China under Grant 2014CB340300, Microsoft Research Asia Gift Grant and Google Faculty Award 2013.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ye Yuan.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yuan, Y., Wang, G., Xu, J.Y. et al. Efficient distributed subgraph similarity matching. The VLDB Journal 24, 369–394 (2015). https://doi.org/10.1007/s00778-015-0381-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-015-0381-6

Keywords

Navigation