Abstract
Effective knowledge graph storage management is identified as the basic premise to make full use of knowledge graphs. Due to the lack of performance evaluation for knowledge graph stores, it is difficult for users to decide which one is the best. However, none of existing studies of performance prediction focuses on storage structures. To fill this gap, we propose a learned performance predictor PreKar to estimate the time costs of processing the given workloads on the candidate stores. However, it is challenging to learn a well-trained model due to the low-diversity of historical workloads and the requirement of lightweight embedding strategies. To address this problem, we first develop a novel candidate stores generator, which not only discovers all possible candidate stores for model training, but also multiplies the umber of training instances. Based on the generated stores, we derive an effective and lightweight encoder to not only embed the main features of workloads and stores into the model, but also guarantee the high-efficiency of PreKar. Experimental results on real knowledge graphs demonstrate that PreKar achieves high accuracy on performance prediction and saves a huge amount of time to obtain performance for knowledge graph stores.



Similar content being viewed by others
References
Abadi, D.J., Marcus, A., Madden, S.R., Hollenbach, K.: Sw-store: a vertically partitioned dbms for semantic web data management. The VLDB Journal 18(2), 385–406 (2009)
Acharya, M.S., Armaan, A., Antony, A.S.: A comparison of regression models for prediction of graduate admissions. In: 2019 International Conference on Computational Intelligence in Data Science (ICCIDS), pp. 1–5 (2019) IEEE
Bruno, N., Chaudhuri, S., Gravano, L.: Stholes: A multidimensional workload-aware histogram. In: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, pp. 211–222 (2001)
Cai, Y., Hang, H., Yang, H., Lin, Z.: Boosted histogram transform for regression. In: International Conference on Machine Learning, pp. 1251–1261 (2020). PMLR
Cai, T., Li, J., Mian, A.S., Sellis, T., Yu, J.X., et al.: Target-aware holistic influence maximization in spatial social networks. IEEE Transactions on Knowledge and Data Engineering (2020)
Chakkappen, S., Budalakoti, S., Krishnamachari, R., Valluri, S.R., Wood, A., Zait, M.: Adaptive statistics in oracle 12c. Proceedings of the VLDB Endowment 10(12), 1813–1824 (2017)
Chen, J., Zhong, M., Li, J., Wang, D., Qian, T., Tu, H.: Effective deep attributed network representation learning with topology adapted smoothing. IEEE Transactions on Cybernetics (2021)
Durand, M., Flajolet, P.: Loglog counting of large cardinalities. In: European Symposium on Algorithms, pp. 605–617 (2003). Springer
Dutt, A., Wang, C., Nazi, A., Kandula, S., Narasayya, V., Chaudhuri, S.: Selectivity estimation for range predicates using lightweight models. Proceedings of the VLDB Endowment 12(9), 1044–1057 (2019)
Dutta, K., Chandra, S., Gourisaria, M.K., Harshvardhan, G.: A data mining based target regression-oriented approach to modelling of health insurance claims. In: 2021 5th International Conference on Computing Methodologies and Communication (ICCMC), pp. 1168–1175 (2021). IEEE
Graefe, G., Ward, K.: Dynamic query evaluation plans. In: Proceedings of the 1989 ACM SIGMOD International Conference on Management of Data, pp. 358–366(1989)
Gunopulos, D., Kollios, G., Tsotras, V.J., Domeniconi, C.: Selectivity estimators for multidimensional range queries over real attributes. the VLDB Journal 14(2), 137–154 (2005)
Gunopulos, D., Kollios, G., Tsotras, V.J., Domeniconi, C.: Approximating multi-dimensional aggregate range queries over real attributes. Acm Sigmod Record 29(2), 463–474 (2000)
Harbi, R., Abdelaziz, I., Kalnis, P., Mamoulis, N., Ebrahim, Y., Sahli, M.: Accelerating sparql queries by exploiting hash-based locality and adaptive partitioning. The VLDB Journal 25(3), 355–380 (2016)
Harris, S., Nicholas, G.: 3store: Efficient bulk rdf storage. In: Proceedings of the 1st International Workshop on Practical and Scalable Semantic Systems, pp. 81–95 (2004)
Hasan, S., Thirumuruganathan, S., Augustine, J., Koudas, N., Das, G.: Deep learning models for selectivity estimation of multi-attribute queries. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pp. 1035–1050 (2020)
Jagadish, H., Jin, H., Ooi, B.C., Tan, K.-L.: Global optimization of histograms. ACM SIGMOD Record 30(2), 223–234 (2001)
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., Liu, T.-Y.: Lightgbm: A highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems 30, 3146–3154 (2017)
Kipf, A., Kipf, T., Radke, B., Leis, V., Boncz, P., Kemper, A.: Learned cardinalities: Estimating correlated joins with deep learning. arXiv:1809.00677 (2018)
Leis, V., Gubichev, A., Mirchev, A., Boncz, P., Kemper, A., Neumann, T.: How good are query optimizers, really? Proceedings of the VLDB Endowment 9(3), 204–215 (2015)
Li, Z., Wang, X., Li, J., Zhang, Q.: Deep attributed network representation learning of complex coupling and interaction. Knowledge-Based Systems 212, 106618 (2021)
LinkedGeoData. (2015) http://www.linkedgeodata.org/About
Liu, F., Blanas, S.: Forecasting the cost of processing multi-join queries via hashing for main-memory databases. In: Proceedings of the Sixth ACM Symposium on Cloud Computing, pp. 153–166(2015)
LUBM. (2020) http://swat.cse.lehigh.edu/projects/lubm/
Mackert, L.F., Lohman, G.M.: R* optimizer validation and performance evaluation for local queries. In: Proceedings of the 1986 ACM SIGMOD International Conference on Management of Data, pp. 84–95 (1986)
Markl, V., Megiddo, N., Kutsch, M., Tran, T.M., Haas, P., Srivastava, U.: Consistently estimating the selectivity of conjuncts of predicates. In: Proceedings of the 31st International Conference on Very Large Data Bases, pp. 373–384 (2005)
Müller, M., Moerkotte, G., Kolb, O.: Improved selectivity estimation by combining knowledge from sampling and synopses. Proceedings of the VLDB Endowment 11(9), 1016–1028 (2018)
Neo4j. (2022) https://neo4j.com/docs/developer-manual/current/
Neumann, T., Weikum, G.: Rdf-3x: a risc-style engine for rdf. Proceedings of the VLDB Endowment 1(1), 647–659 (2008)
Ozcaglar, C., Geyik, S., Schmitz, B., Sharma, P., Shelkovnykov, A., Ma, Y., Buchanan, E.: Entity personalized talent search models with tree interaction features. In: The World Wide Web Conference, pp. 3116–3122 (2019)
Pan, Z., Heflin, J.: Dldb: Extending relational databases to support semantic web queries. In: Proceedings of the 1st International Workshop on Practical and Scalable Semantic Systems, pp. 109–113 (2004)
Qi, Z., Wang, H., Zhang, H.: A dual-store structure for knowledge graphs. arXiv e-prints, 2012 (2020)
Srivastava, U., Haas, P.J., Markl, V., Kutsch, M., Tran, T.M.: Isomer: Consistent histogram construction using query feedback. In: 22nd International Conference on Data Engineering (ICDE’06), pp. 39–39 (2006). IEEE
Sun, W., Fokoue, A., Srinivas, K., Kementsietsidis, A., Hu, G., Xie, G.: Sqlgraph: An efficient relational-based property graph store. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1887–1901 (2015)
Sun, J., Li, G.: An end-to-end learning-based cost estimator. Proceedings of the VLDB Endowment 13(3), 307–319 (2019)
To, H., Chiang, K., Shahabi, C.: Entropy-based histograms for selectivity estimation. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, pp. 1939–1948 (2013)
Tyree, S., Weinberger, K.Q., Agrawal, K., Paykin, J.: Parallel boosted regression trees for web search ranking. In: Proceedings of the 20th International Conference on World Wide Web, pp. 387–396 (2011)
UniProt. (2021) https://www.uniprot.org/help/about
Wang, X., Chen, W.: Knowledge graph data management: Models, methods, and systems. In: International Conference on Web Information Systems Engineering, pp. 3–12 (2020). Springer
Wang, X., Qu, C., Wu, W., Wang, J., Zhou, Q.: Are we ready for learned cardinality estimation? Proceedings of the VLDB Endowment 14(9), 1640–1654 (2021)
WatDiv query templates. https://dsg.uwaterloo.ca/watdiv/basic-testing.shtml
WatDiv. (1933) https://dsg.uwaterloo.ca/watdiv/
Weiss, C., Karras, P., Bernstein, A.: Hexastore: sextuple indexing for semantic web data management. Proceedings of the VLDB Endowment 1(1), 1008–1019 (2008)
Whang, K.-Y., Vander-Zanden, B.T., Taylor, H.M.: A linear-time probabilistic counting algorithm for database applications. ACM Transactions on Database Systems (TODS) 15(2), 208–229 (1990)
Wilkinson, K.: Jena property table implementation. In: Proceedings of the 2nd International Workshop on Scalable Semantic Web Knowledge Base Systems, pp. 35–46 (2006)
Wu, W., Chi, Y., Zhu, S., Tatemura, J., Hacigümüs, H., Naughton, J.F.: Predicting query execution time: Are optimizer cost models really unusable? In: 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp. 1081–1092 (2013). IEEE
Wu, W., Naughton, J.F., Singh, H.: Sampling-based query re-optimization. In: Proceedings of the 2016 International Conference on Management of Data, pp. 1721–1736 (2016)
Wu, Y.-L., Agrawal, D., El Abbadi, A.: Applying the golden rule of sampling for query estimation. ACM SIGMOD Record 30(2), 449–460 (2001)
Xue, G., Zhong, M., Li, J., Chen, J., Zhai, C., Kong, R.: Dynamic network embedding survey. Neurocomputing 472, 212–223 (2022)
YAGO. (2020) https://yago-knowledge.org/
Yang, Y., Guan, Z., Li, J., Zhao, W., Cui, J., Wang, Q.: Interpretable and efficient heterogeneous graph convolutional network. IEEE Transactions on Knowledge and Data Engineering (2021)
Zhang, Z., Yang, W., Wushour, S.: Traffic accident prediction based on lstm-gbrt model. Journal of Control Science and Engineering 2020 (2020)
Zou, L., Özsu, M.T., Chen, L., Shen, X., Huang, R., Zhao, D.: gstore: a graph-based sparql query engine. The VLDB Journal 23(4), 565–590 (2014)
Acknowledgements
This paper was supported by National Nature Science Foundation of China grant U1866602.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
The authors declare that they have no conflict of interest.
Financial interests
This study was funded by National Nature Science Foundation of China (Grant number U1866602).
Non-financial interests
none.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article belongs to the Topical Collection: Special Issue on Decision Making in Heterogeneous Network Data Scenarios and Applications
Guest Editors: Jiannxin Li, Chengfei Liu, Ziyu Guan, and Yinghui Wu.
Rights and permissions
About this article
Cite this article
Qi, Z., Wang, H., Shen, Z. et al. PreKar: A learned performance predictor for knowledge graph stores. World Wide Web 26, 321–341 (2023). https://doi.org/10.1007/s11280-022-01033-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-022-01033-2