Abstract
There are plentiful and diverse applications of graph data management and mining techniques in the real-world scientific research and business activities. As one of the most basic operations, uniform path pattern query processing on graph data faces three big challenges. In this paper, we deal with these challenges by the following points. Firstly, a new query language on graph, called G-Path, is presented, which focuses on complex path pattern query processing on a very large graph. Also, the design of a system called Para-G is proposed, which is based on a BSP-like model as well as MapReduce model, and can effectively handle distributed graph data operations and queries. Secondly, the implementation of Para-G on the de facto cloud platform — Hadoop — is brought forward. Based on the concept of distributed path finite state automaton, the query processing of a G-Path statement in Para-G is detailed. In addition, as the query optimization of G-Path queries, several tricks are utilized to dramatically improve the performance of query execution. Finally, extensive experiments on several graph data sets are conducted to show the usability of the G-Path query language and the effectiveness of Para-G.
Similar content being viewed by others
References
Abiteboul, S., Quass, D., McHugh, J., Widom, J., Wiener, J.L.: The lorel query language for semistructured data. Int. J. Dig. Libr. 1(1), 68–88 (1997).
Agrawal, R., Borgida, A., Jagadish, H.V.: Efficient management of transitive relationships in large data and knowledge bases. Proceedings of the 1989 ACM SIGMOD International Conference on Management of Data. ACM, 253–262 (1989).
Bai, Y., Wang, C., Ning, Y., Wu, H., Wang, H.: G-Path: Flexible path pattern query on large graphs. Proceedings of the 22nd International Conference on World Wide WEB (Companion Volume). ACM Press, Rio de Janeiro, Brazil, 333–336 (2013).
Bai, Y., Wang, C., Ying, X., Wang, M., Gong, Y.: Path pattern query processing on large graphs. Proceedings of the 3rd International Workshop on Graph Databases and Social Networking (GSN. IEEE Press, Sydney, Australia, 2014 (2014).
Chen, L., Gupta, A., Kurul, M.E.: Stack-based algorithms for pattern matching on DAGs. Proceedings of the 31st International Conference on Very Large Data Bases, VLDB ’05. VLDB Endowment, 493–504 (2005).
Chen, Y., Chen, Y.: An efficient algorithm for answering graph reachability queries. IEEE 24th International Conference on Data Engineering, 2008. ICDE 2008, 893 –902 (2008).
Cheng, J., Yu, J.X., Lin, X., Wang, H., Yu, P.S.: Fast computation of reachability labeling for large graphs. Advance Database Technology-EDBT 2006 pp. 961–979, 2006.
Cohen, E., Halperin, E., Kaplan, H., Zwick, U.: Reachability and distance queries via 2-Hop labels. SIAM J. Comput. 32(5), 1338–1355 (2003).
Consens, M.P., Mendelzon, A.O.: GraphLog: A visual formalism for real life recursion. Proceedings of the Ninth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM, 404–416 (1990).
Detwiler, L.T., Suciu, D., Brinkley, J.F.: Regular paths in SPARQL: Querying the NCI thesaurus. AMIA Annual Symposium Proceedings. American Medical Informatics Association, 161 (2008).
Fan, W.: Graph pattern matching revised for social network analysis. Proceedings of the 15th International Conference on Database Theory. ACM, 8–21 (2012).
Fan, W., Li, J., Ma, S., Tang, N., Wu, Y.: Adding regular expressions to graph reachability and pattern queries. Front. Comput. Sci. 6(3), 313–338 (2012).
Florescu, D., Levy, A., Suciu, D.: Query containment for conjunctive queries with regular expressions. Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM, 139–148 (1998).
Giugno, R., Shasha, D.: Graphgrep: A fast and universal method for querying graphs. Proceedings of 16th International Conference on Pattern Recognition. IEEE, 112–115 (2002).
Han, W.S., Lee, J., Pham, M.D., Yu, J.X.: iGraph: A framework for comparisons of disk-based graph indexing techniques. Proc. VLDB Endowment. 3(1), 449–459 (2010).
He, H., Singh, A.K.: GraphQL: Query language and access methods for graph databases. Technical Report, Technical Report, Department of Computer Science at University of California, Santa Barbara (2007).
Husain, M.F., Khan, L., Kantarcioglu, M., Thuraisingham, B.: Data intensive query processing for large RDF graphs using cloud computing tools. 2010 IEEE 3rd International Conference on Cloud Computing (CLOUD). IEEE, 1–10 (2010).
Jagadish, H.V.: A compression technique to materialize transitive closure. ACM Trans. Database Syst. 15(4), 558–598 (1990).
Jin, R., Xiang, Y., Ruan, N., Wang, H.: Efficiently answering reachability queries on very large directed graphs. Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data. ACM, 595–608 (2008).
Kashima, H., Tsuda, K., Inokuchi, A.: Marginalized kernels between labeled graphs. Proc. Twentieth Int. Conf. Mach. Learn. 20(1), 321 (2003).
Lee, W., Leung, C.K.S., Lee, J.J.H.: Mobile web navigation in digital ecosystems using rooted directed trees. IEEE Trans. Indust. Electron. (TIE). 58(6), 2154–2162 (2011).
Liu, Z., Wang, C., Wang, J.: Aggregate Nearest Neighbor Queries in Uncertain Graphs. World Wide WEB J. 17(1), 161–188 (2014).
Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: A system for large-scale graph processing. Proceedings of the 2010 International Conference on Management of Data. ACM, 135–146 (2010).
McNaughton, R., Yamada, H.: Regular expressions and state graphs for automata. IRE Transactions on Electronic Computers, 39–47 (1960).
Mendelzon, A.O., Wood, P.T.: Finding regular simple paths in graph databases. SIAM J. Comput. 24(6), 1235–1258 (1995).
Peng, Z., Wang, C.: Member promotion in social networks via skyline. World Wide WEB J. 17(4), 457–492 (2014).
Prud’hommeaux, E., Seaborne, A.: SPARQL query language for RDF. W3C Recomm. 15 (2008).
Ronen, R., Shmueli, O.: SoQL: A language for querying and creating data in social networks. IEEE 25th International Conference on Data Engineering, 2009. ICDE’09. IEEE, 1595–1602 (2009).
Simon, K.: An improved algorithm for transitive closure on acyclic digraphs. Theor. Comput. Sci. 58(1–3), 325–346 (1988).
Yang, Y., Yu, J.X., Gao, H., Pei, J., Li, J.: Mining most frequently changing component in evolving graphs. World Wide WEB J. 17(3), 351–376 (2014).
Zou, L., Chen, L., Özsu, M. T.: Distance-join: Pattern match query in a large graph database. Proc. VLDB Endowment. 2(1), 886–897 (2009).
Acknowledgments
This work was partially supported by the National Natural Science Foundation of China (No. 61170064, No. 61373023) and the National High Technology Research and Development Program of China (No. 2013AA013204).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Bai, Y., Wang, C. & Ying, X. Para-G: Path pattern query processing on large graphs. World Wide Web 20, 515–541 (2017). https://doi.org/10.1007/s11280-016-0401-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-016-0401-5