Skip to main content
Log in

GStar: an efficient framework for answering top-k star queries on billion-node knowledge graphs

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

Massive knowledge graphs, such as Linked Open Data or Freebase, contain billions of labeled entities and relationships. Star queries aim to identify an entity given a set of related entities, and they are common with massive knowledge graphs. It is important to find the best way to answer star queries, and we can do this by treating it as a graph pattern-matching problem. Because knowledge graphs are noisy and incomplete in nature, we must find answers that match the star pattern closely, and extract a precise match if possible. Thus, here we propose GStar, a framework to identify the top-k best answers for a star query. GStar effectively and efficiently answers top-k star queries on billion-node graphs through a novel query model, an index-free query algorithm, and a distributed query system. We evaluate GStar through experiments on real-world knowledge graphs. Experimental results show that our query model effectively answers real-life star-pattern queries; our query algorithm can answer top-k queries in a near-real-time manner without requiring expensive graph indices; and the distributed system scales well with both the graph size and number of machines used for computation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15

Similar content being viewed by others

Notes

  1. Star queries are common on many databases. For relational databases, a star query joins a number of small (dimension) tables to a large (fact) table using a primary key to foreign key join, while for RDF (Resource Description Framework) databases, the star query has the form of a number of triple patterns with different properties sharing the same subject. In this paper, a star query has a star shape where the root node represents a queried entity that is unknown and the leaf nodes represent related entities that are already known.

  2. We recommend to set \(\alpha \)’s value smaller than \(\frac {1}{N|{V^{S}_{Q}}|}\). The setting of \(\alpha \) will be discussed in Section 6.5.

  3. Ω+ is a node set containing the candidates that are visited by the propagations. We use \({\Omega }_{+}\) to instead \({\Omega }\) due to the algorithm’s efficiency.

  4. http://www.informatik.uni-trier.de/

  5. http://dblp.l3s.de/dblp++.php

  6. In Figure 5, the edge between “Get Back” and “Hey Jude!” indicates the relationship of “released after”.

  7. Steven Spielberg is the executive producer of Transformers and The Lovely Bones.

  8. Here, we suppose the cap of path number, N, is a fixed constant for all settings of \(\alpha \).

  9. http://glaros.dtc.umn.edu/gkhome/metis/metis/overview

References

  1. Akiba, T., Sommer, C., Kawarabayashi, K.-i.: Shortest-path queries for complex networks: Exploiting low tree-width outside the core. In: Proceedings of the 15th International Conference on Extending Database Technology, EDBT ’12, pp 144–155. ACM, New York (2012)

  2. Akiba, T., Iwata, Y., Yoshida, Y.: Fast exact shortest-path distance queries on large networks by pruned landmark labeling. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD ’13, pp 349–360. ACM, New York (2013)

  3. Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hellmann, S.: DBpedia - a crystallization point for the Web of data. Web Semant. 7(3), 154–165 (2009)

    Article  Google Scholar 

  4. Brandes, U.: A faster algorithm for betweenness centrality. J. Math. Sociol. 25, 163–177 (2001)

    Article  MATH  Google Scholar 

  5. Chakrabarti, D., Zhan, Y., Faloutsos, C.: R-MAT: A recursive model for graph mining. In: Proceedings of the Fourth SIAM International Conference on Data Mining, SDM’04, pp. 442–446 (2004)

  6. Checconi, F., Petrini, F.: Traversing trillions of edges in real time: Graph exploration on large-scale parallel machines. In: Proceedings of the 28th IEEE International Parallel and Distributed Processing Symposium, IPDPS ’14, pp. 425–434 (2014)

  7. Cheng, J., Zeng, X., Yu, J.X.: Top-k graph pattern matching over large graphs. In: Proceedings of the 29th IEEE International Conference on Data Engineering, ICDE ’13, pp. 1033–1044 (2013)

  8. Dong, X., Gabrilovich, E., Heitz, G., Horn, W., Ni, L., Murphy, K., Strohmann, T., Sun, S., Zhang, W.: Knowledge vault: A Web-scale approach to probabilistic knowledge fusion. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, pp. 601–610 (2014)

  9. Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. In: Proceedings of the Twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ’01, pp 102–113. ACM, New York (2001)

  10. Fan, W., Li, J., Ma, S., Tang, N., Wu, Y., Wu, Y.: Graph pattern matching: From intractable to polynomial time. PVLDB, 3(1–2), 264–275 (2010)

    Google Scholar 

  11. Han W.-S., Lee, J., Pham, M.-D., Yu, J.X.: igraph: A framework for comparisons of disk-based graph indexing techniques. PVLDB 3(1), 449–459 (2010)

    Google Scholar 

  12. He, H., Wang, H., Yang, J., Yu, P.S.: Blinks: Ranked keyword searches on graphs. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, SIGMOD ’07, pp 305–316. ACM, New York (2007)

  13. Ilyas, I.F., Beskales, G., Soliman, M.A.: A survey of top-k query processing techniques in relational database systems. ACM Comput. Surv. 40(4), 11,1–11,58 (2008)

    Article  Google Scholar 

  14. Jin, J., Khemmarat, S., Gao, L., Luo, J.: A distributed approach for top-k star queries on massive information networks. In: Proceedings of the 20th IEEE International Conference on Parallel and Distributed Systems, ICPADS ’14, pp. 9–16 (2014)

  15. Jin, J., Luo, J., Khemmarat, S., Dong, F., Gao, L.: Supplementary file of gstar: An efficient framework for answering top-k star queries on billion-node knowledge graphs. http://cse.seu.edu.cn/PersonalPage/jhjin/upload/supplementary-file-wwwj.pdf (2017)

  16. Jin, J., Luo, J., Khemmarat, S., Gao, L.: Querying Web-scale knowledge graphs through effective pruning of search space. IEEE Trans Parallel Distrib Syst 28 (8), 2342–2356 (2017)

    Article  Google Scholar 

  17. Khan, A., Li, N., Yan, X., Guan, Z., Chakraborty, S., Tao, S.: Neighborhood based fast graph search in large networks. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD ’11, pp 901–912. ACM, New York (2011)

  18. Khan, A., Wu, Y., Aggarwal, C.C., Yan, X.: Nema: Fast graph search with label similarity. PVLDB 6(3), 181–192 (2013)

    Google Scholar 

  19. Khemmarat, S., Gao, L.: Fast top-k path-based relevance query on massive graphs. In Proceedings of the 30th IEEE International Conference on Data Engineering, ICDE ’14, pp. 316–327 (2014)

  20. Lee, J., Han, W.-S., Kasperovics, R., Lee, J.-H.: An in-depth comparison of subgraph isomorphism algorithms in graph databases. PVLDB, 6(2), 133–144 (2012)

    Google Scholar 

  21. Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.M.: Distributed graphlab: A framework for machine learning in the cloud. PVLDB 5(8), 716–727 (2012)

    Google Scholar 

  22. Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: A system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD ’10, pp 135–146. ACM, New York (2010)

  23. Neumann, T., Weikum, G.: Rdf-3x: A risc-style engine for rdf. PVLDB 1(1), 647–659 (2008)

    Google Scholar 

  24. Neumann, T., Bender, M., Michel, S., Schenkel, R., Triantafillou, P., Weikum, G.: Distributed top-k aggregation queries at large. Distrib Parallel Datab 26(1), 3–27 (2009)

    Article  Google Scholar 

  25. Power, R., Li, J.: Piccolo: Building fast, distributed programs with partitioned tables. In: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, OSDI’10, pp 1–14. USENIX Association, Berkeley (2010)

  26. Qiu, T., Qiao, R., Han, M., Sangaiah, A.K., Lee, I.: A lifetime-enhanced data collecting scheme for internet of things. IEEE Commun. Mag. 55(11), 132–137 (2017)

    Article  Google Scholar 

  27. Qiu, T., Zhao, A., Xia, F., Si, W., Wu, D.: ROSE: Robustness strategy for scale-free wireless sensor networks. IEEE/ACM Trans. Network. 25(5), 2944–2959 (2017)

    Article  Google Scholar 

  28. Shang, H., Zhang, Y., Lin, X., Yu, J.X.: Taming verification hardness: An efficient algorithm for testing subgraph isomorphism. Proc. VLDB Endow. 1(1), 364–375 (2008)

    Article  Google Scholar 

  29. Stanton, I., Kliot, G.: Streaming graph partitioning for large distributed graphs. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’12, pp 1222–1230. ACM, New York (2012)

  30. Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: A core of semantic knowledge. In: Proceedings of the 16th International Conference on World Wide Web, WWW ’07, pp 697–706. ACM, New York (2007)

  31. Sun, Z., Wang, H., Wang, H., Shao, B., Li, J.: Efficient subgraph matching on billion node graphs. PVLDB 5(9), 788–799 (2012)

    Google Scholar 

  32. Tian, Y., Patel, J.M.: Tale: A tool for approximate large graph matching. In: Proceedings of the 24th IEEE International Conference on Data Engineering, ICDE ’08, pp. 963–972 (2008)

  33. Tong, H., Faloutsos, C., Gallagher, B., Eliassi-Rad, T.: Fast best-effort pattern matching in large attributed graphs. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’07, pp 737–746. ACM, New York (2007)

  34. Vrandečić, D., Krötzsch, M.: Wikidata: A free collaborative knowledgebase. Commun. ACM 57(10), 78–85 (2014)

    Article  Google Scholar 

  35. Yan, D., Cheng, J., Yang, F., Lu, Y., Lui, J.C.S., Zhang, Q., Ng, W.: A general-purpose query-centric framework for querying big graphs. PVLDB 9(7), 564–575 (2016)

    Google Scholar 

  36. Zeng, K., Yang, J., Wang, H., Shao, B., Wang, Z.: A distributed graph engine for Web scale rdf data. PVLDB, 265–276 (2013)

  37. Zhang, Y., Gao, Q., Gao, L., Wang, C.: Maiter: An asynchronous graph processing framework for delta-based accumulative iterative computation. IEEE Tran Parallel Distrib Syst 25(8), 2091–2100 (2014)

    Article  Google Scholar 

  38. Zou, L., Chen, L., Tamer Özsu, M.: Distance-join: Pattern match query in a large graph database. PVLDB 2(1), 886–897 (2009)

    Google Scholar 

  39. Zou, L., Mo, J., Chen, L., Tamer Özsu, M., Dongyan, Z.: gStore: Answering SPARQL queries via subgraph matching. PVLDB 4(8), 482–493 (2011)

    Google Scholar 

Download references

Acknowledgements

This work is supported by National Key R&D Program of China 2017YFB1003000, National Natural Science Foundation of China under Grants No. 61702096, No. 61632008, No. 61320106007, No. 61572129, No. 61602112, No. 61502097, No. 61370207 and No. 61702097; International S&T Cooperation Program of China No. 2015DFA10490; the Natural Science Foundation of Jiangsu Province under grant BK20170689; BK20160695 and Jiangsu Provincial Key Laboratory of Network and Information Security under Grants No.BM2003201; Key Laboratory of Computer Network and Information Integration of Ministry of Education of China under Grants No.93K-9; the Fundamental Research Funds for the Central Universities; and partially supported by Collaborative Innovation Center of Novel Software Technology and Industrialization and Collaborative Innovation Center of Wireless Communications Technology. This work is also partially supported by U.S. NSF grants CNS-1217284 and CCF-1018114. Jiahui Jin was a visiting student at UMass Amherst, supported by China Scholarship Council, when this work was performed. Any opinions, findings, conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of the sponsor. Preliminary version [14] of this paper appeared in Proceedings of the 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS’14).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jiahui Jin.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jin, J., Luo, J., Khemmarat, S. et al. GStar: an efficient framework for answering top-k star queries on billion-node knowledge graphs. World Wide Web 22, 1611–1638 (2019). https://doi.org/10.1007/s11280-018-0611-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-018-0611-0

Keywords

Navigation