Abstract
Leveraging relational Big Data (BD) processing frameworks to process large-scale (RDF) graphs yields a great interest in optimizing query performance. Modern BD systems are yet complicated data systems, where the configurations notably affect the performance. Benchmarking different frameworks and configurations provides the community with best practices for better performance. However, most of these benchmarking efforts are classified as descriptive and diagnostic analytics. Moreover, there is no standard for comparing these benchmarks based on quantitative ranking techniques. In this paper, we discuss how our work fills this timely research gap. Particularly, we investigate how to enable prescriptive analytics via ranking functions (called “BenchRank”). We present a research plan that builds on the state-of-the-art benchmarking efforts in the area of querying large RDF graphs. Finally, we present our research results of the proposed plan.
M. Ragab—Supervised by Riccardo Tommasini, LIRIS Lab, INSA Lyon, France.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The relational schema impacts query joins, partitioning techniques impact data shuffling, whilst storage formats impact physical execution plans.
- 2.
We omit details about schema options (ST, VP, PT) and partitioning options (HP, SBP, PBP) due to space limits, however, still can be found in the project’s GitHub page: https://datasystemsgrouput.github.io/SPARKSQLRDFBenchmarking/.
- 3.
Each configuration C has a rank according to its running time of the queries.
- 4.
Kendall’s index is a common measure to compare the ordering of ranking functions.
- 5.
Conformance and Coherence results [7] are omitted due to space limits.
References
Abdelaziz, I., Harbi, R., Khayyat, Z., Kalnis, P.: A survey and experimental comparison of distributed SPARQL engines for very large RDF data. VLDB 10(13), 2049–2060 (2017)
Akhter, A., Ngomo Ngonga, A.-C., Saleem, M.: An empirical evaluation of RDF graph partitioning techniques. In: Faron Zucker, C., Ghidini, C., Napoli, A., Toussaint, Y. (eds.) EKAW 2018. LNCS (LNAI), vol. 11313, pp. 3–18. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-03667-6_1
Arrascue Ayala, V.A..: Relational schemata for distributed SPARQL query processing. In: SBD (2019)
Deb, K., Pratap, A., Agarwal, S.: A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 6(2), 182–197 (2002)
Ivanov, T., Pergolesi, M.: The impact of columnar file formats on SQL-on-hadoop engine performance: a study on ORC and parquet. Concurr. Comput. Pract. Exp. 32(5), e5523 (2019)
Moaawad, M.R., Mokhtar, H.M.O., Al Feel, H.T.: On-the-fly academic linked data integration. In: Proceedings of the International Conference on Compute and Data Analysis, pp. 114–122 (2017)
Ragab, M., Awaysheh, F.M., Tommasini, R.: Bench-ranking: a first step towards prescriptive performance analyses for big data frameworks. In: IEEE Conference on Big Data (2021)
Ragab, M., Tommasini, R., et al.: An in-depth investigation of large-scale RDF relational schema optimizations using Spark-SQL. In: DOLAP@EDBT/ICDT (2021)
Ragab, M., Tommasini, R., Eyvazov, S., Sakr, S.: Towards making sense of Spark-SQL performance for processing vast distributed RDF datasets. In: SBD (2020)
Ragab, M., Tommasini, R., Sakr, S.: Benchmarking Spark-SQL under alliterative RDF relational storage backends. In: QuWeDa@ ISWC, pp. 67–82 (2019)
Ragab, M., Tommasini, R., Sakr, S.: Comparing schema advancements for distributed RDF querying using SparkSQL. In: ISWC 2020 Demos and Industry Tracks (2020)
Sakr, S., Bonifati, A., Voigt, H., et al.: The future is big graphs: a community view on graph processing systems. CACM 64(9), 62–71 (2021)
Schätzle, A., Przyjaciel-Zablocki, M., Neu, A., Lausen, G.: Sempala: interactive SPARQL query processing on hadoop. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 164–179. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11964-9_11
Schätzle, A., Przyjaciel-Zablocki, M., Skilevic, S., Lausen, G.: S2RDF: RDF querying with SPARQL on spark. VLDB 9(10), 804–815 (2016)
Tommasini, R., Ragab, M., Falcetta, A., Valle, E.D., Sakr, S.: A first step towards a streaming linked data life-cycle. In: Pan, J.Z., et al. (eds.) ISWC 2020. LNCS, vol. 12507, pp. 634–650. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-62466-8_39
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Ragab, M. (2022). Towards Prescriptive Analyses of Querying Large Knowledge Graphs. In: Chiusano, S., et al. New Trends in Database and Information Systems. ADBIS 2022. Communications in Computer and Information Science, vol 1652. Springer, Cham. https://doi.org/10.1007/978-3-031-15743-1_59
Download citation
DOI: https://doi.org/10.1007/978-3-031-15743-1_59
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-15742-4
Online ISBN: 978-3-031-15743-1
eBook Packages: Computer ScienceComputer Science (R0)