Abstract
Given a query graph \(q\) and a data graph \(G\), subgraph similarity matching is to retrieve all matches of \(q\) in \(G\) with the number of missing edges bounded by a given threshold \(\epsilon \). Many works have been conducted to study the problem of subgraph similarity matching due to its ability to handle applications involved with noisy or erroneous graph data. In practice, a data graph can be extremely large, e.g., a web-scale graph containing hundreds of millions of vertices and billions of edges. The state-of-the-art approaches employ centralized algorithms to process the subgraph similarity queries, and thus, they are infeasible for such a large graph due to the limited computational power and storage space of a centralized server. To address this problem, in this paper, we investigate subgraph similarity matching for a web-scale graph deployed in a distributed environment. We propose distributed algorithms and optimization techniques that exploit the properties of subgraph similarity matching, so that we can well utilize the parallel computing power and lower the communication cost among the distributed data centers for query processing. Specifically, we first relax and decompose \(q\) into a minimum number of sub-queries. Next, we send each sub-query to conduct the exact matching in parallel. Finally, we schedule and join the exact matches to obtain final query answers. Moreover, our workload-balance strategy further speeds up the query processing. Our experimental results demonstrate the feasibility of our proposed approach in performing subgraph similarity matching over web-scale graph data.
Similar content being viewed by others
Notes
We assume \(q(v)>0\) in the following of this paper.
References
Afrati, F.N., Fotakis, D., Ullman, J.D.: Enumerating subgraph instances using map-reduce. In: ICDE (2013)
Aggarwal, C., Wang, H.: Managing and Mining Graph Data. Springer, Berlin (2010)
Andreev, K., Racke, H.: Balanced graph partitioning. Theory Comput. Syst. 39(6), 929–939 (2006)
Chakrabarti, D., Zhan, Y., Faloutsos, C.: R-mat: a recursive model for graph mining. In: SDM, vol. 4, pp. 442–446. SIAM (2004)
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms. MIT Press, Cambridge (2001)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Fan, W., Wang, X., Wu, Y.: Performance guarantees for distributed reachability queries. In: VLDB, pp. 1304–1316 (2012)
Gao, X., Xiao, B., Tao, D., Li, X.: A survey of graph edit distance. Pattern Anal. Appl. 13(1), 113–129 (2010)
Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, San Francisco (1979)
Hochbaum, D. (ed.) Approximation Algorithms for NP-Hard Problems. PWS (1997)
Kang, U., Tsourakakis, C.E.: Pegasus: a peta-scale graph mining system implementation and observations. In: ICDM (2009)
Kwak, H., Lee, C., Park, H., Moon, S.B.: What is twitter, a social network or a news media? In: WWW, pp. 591–600 (2010)
Ma, S., Cao, Y., Huai, J., Wo, T.: Distributed graph pattern matching. In: WWW, pp. 949–958. ACM (2012)
Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: SIGMOD, pp. 135–146. ACM (2010)
Papadias, D., Tao, Y., Fu, G., Seeger, B.: An optimal and progressive algorithm for skyline queries. In: SIGMOD (2003)
Plantenga, T.: Inexact subgraph isomorphism in mapreduce. J. Parallel Distrib. Comput. 73(2), 164–175 (2013)
Shang, Z., Yu, J.X.: Catch the wind: graph workload balancing on cloud. In: ICDE, pp.553–564 (2013)
Shang, H., Zhu, K., Lin, X., Zhang, Y., Ichise, R.: Similarity search on supergraph containment. In: Proceedings of ICDE, pp. 637–648 (2010)
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: MSST, pp. 1–10. IEEE (2010)
Srivatsa, M., Kawadia, V., Yang, S.: Distributed graph query processing in dynamic networks. In: Pervasive Computing and Communications Workshops (PERCOM Workshops), 2012 IEEE International Conference on, pp. 20–25. IEEE (2012)
Stanton, I., Kliot, G.: Streaming graph partitioning for large distributed graphs. In: KDD, pp. 1222–1230. ACM (2012)
Sun, Z., Wang, H., Shao, B., Wang, H., Li, J.: Efficient subgraph matching on billion node graphs. In: VLDB (2012)
Ozsu, M.T., Valduriez, P.: Principles of Distributed Database Systems. Springer, Berlin (2011)
Yan, X., Yu, P.S., Han, J.: Substructure similarity search in graph databases. In: Proceedings of SIGMOD, pp. 766–777 (2005)
Yang, S., Yan, X., Zong, B., Khan, A.: Towards effective partition management for large graphs. In: SIGMOD, pp. 517–528 (2012)
Yuan, Y., Wang, G., Chen, L., Wang, H.: Efficient subgraph similarity search on large probabilistic graph databases. In: Proceedings of VLDB, pp. 800–811 (2012)
Yuan, Y., Wang, G., Chen, L., Wang, H.: Graph similarity search on large uncertain graph databases. VLDB J. pp. 1–26 (2014)
Yuan, Y., Wang, G., Wang, H., Chen, L.: Efficient subgraph search over large uncertain graphs. In: Proceedings of VLDB, pp. 876–886 (2011)
Yuan, Y., Wang, G., Chen, L., Wang, H.: Efficient keyword search on uncertain graph data. TKDE 25(12), 2767–2779 (2013)
Zeng, Z., Tung, A.K.H., Wang, J., Zhou, L., Feng, J.: Comparing stars: on approximating graph edit distance. In: VLDB (2009)
Zeng, K., Yang, J., Wang, H., Shao, B., Wang, Z.: A distributed graph engine for web scale rdf data. In: VLDB (2013)
Zhang, S., Yang, J., Jin, W.: Sapper: subgraph indexing and approximate matching in large graphs. In: VLDB (2010)
Zhao, P., Han, J.: On graph query optimization in large networks. Proc. VLDB Endow. 3(1–2), 340–351 (2010)
Zhu, G., Lin, X., Zhu, K., Zhang, W., Yu, J.X.: Treespan: efficiently computing similarity all-matching. In: SIGMOD (2012)
Acknowledgments
This work is supported in part by the NSFC (Grant No. 61100024, 61332006, U1401256), the Fundamental Research Funds for the Central Universities (Grant No. N130504006), the National Basic Research Program of China (973, Grant No. 2011CB302200-G), the Research Grants Council of the Hong Kong SAR, China (Grant No. 14209314 and 418512), the NSFC (Grant No. 61328202), the Hong Kong RGC Project N HKUST637/13, the National Grand Fundamental Research 973 Program of China under Grant 2014CB340300, Microsoft Research Asia Gift Grant and Google Faculty Award 2013.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Yuan, Y., Wang, G., Xu, J.Y. et al. Efficient distributed subgraph similarity matching. The VLDB Journal 24, 369–394 (2015). https://doi.org/10.1007/s00778-015-0381-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-015-0381-6