Abstract
The rise of diversified data engines has created the need for federated queries. A federated query can take a query and provide data analysis based on data from various data engines. Since the query data originates from multiple data engines, federated queries usually rely on join operation and data migration to complete the query and take a long time. The challenges of optimizing federated queries lie on join order selection and data migration coordination. However, enumerating all join orders is impractical because the set of join orders grows exponentially with the number of relations to be joined. To improve the performance of federated queries, we present a deep reinforcement learning-based approach on optimizing join order and join engine selection for federated queries and design an deep Q-network-based (DQN-based) optimizer. The DQN-based optimizer can generate join search policies that optimize the join order selection for datasets with a given cost model. Based on the DQN-based optimizer, we implement a federated query system Coral which can provide optimization for join order selection of federated queries. With the optimized join order, Coral can transform a federated query into a set of subqueries which will be assigned to and executed on different data engines. We also propose a subquery cache optimization to optimize data migration during the query execution. The extensive experimental evaluation demonstrates that Coral can significantly reduce the query latency of federated queries and achieve a speedup of up to 5.03\(\times \) compared to the cutting-edge federated query systems.















Similar content being viewed by others
Notes
Coral is open sourced at https://github.com/PasaLab/Coral
References
Clickhouse. https://clickhouse.tech (2021). Accessed 20 Sep 2021
Elasticsearch. https://www.elastic.co (2021). Accessed 20 Sep 2021
Shamgunov, N.: The MemsQL in-memory database system. In: Proceedings of the 2nd International Workshop on In Memory Data Management and Analytics (IMDM ’14), p. 106 (2014)
Flink. https://flink.apache.org (2021). Accessed 21 Sep 2021
The dzone guide to data persistence. https://dzone.com/guides/data-persistence-2 (2021). Accessed 21 Sep 2021
Xu, L., Cole, R.L., Ting, D.: Learning to optimize federated queries. In: Proceedings of the 2nd ACM International Workshop on Exploiting Artificial Intelligence Techniques for Data Management (aiMD’19), pp. 1–7 (2019)
Giannakouris, V., Papailiou, N., Tsoumakos, D., Koziris, N.: MuSQLE: Distributed SQL query execution over multiple engine environments. In: Proceedings of the 4th IEEE International Conference on Big Data (BigData ’16), pp. 452–461 (2016)
Duggan, J., Elmore, A.J., Stonebraker, M., Balazinska, M., Howe, B., Kepner, J., Madden, S., Maier, D., Mattson, T., Zdonik, S.B.: The BigDAWG polystore system. ACM SIGMOD Record 44(2), 11–16 (2015)
LeFevre, J., Sankaranarayanan, J., Hacigümüs, H., Tatemura, J., Polyzotis, N., Carey, M.J.: MISO: souping up big data query processing with a multistore system. In: Proceedings of the 33rd ACM International Conference on Management of Data (SIGMOD ’14), pp. 1591–1602 (2014)
Vogt, M., Stiemer, A., Schuldt, H.: ICARUS: Towards a multistore database system. In: Proceedings of the 5th IEEE International Conference on Big Data (BigData ’17), pp. 2490–2499 (2017)
Ying. Research and implementation on cross-platform unified big data SQL query system. Master’s thesis, Nanjing University (2019)
Begoli, E., Camacho-Rodríguez, J., Hyde, J., Mior, M.J., Lemire, D.: Apache calcite: A foundational framework for optimized query processing over heterogeneous data sources. In: Proceedings of the 37th ACM International Conference on Management of Data (SIGMOD ’18), pp. 221–230 (2018)
Postgresql. https://www.postgresql.org (2021). Accessed 3 Oct 2021
Apache spark. http://spark.apache.org (2021). Accessed 3 Oct 2021
Prestodb. https://prestodb.github.io/ (2021). Accessed 3 Oct 2021
Kostas, T., Sellis, T., Jensen, C.S.: A reinforcement learning approach for adaptive query processing. Technical Report (2008)
Marcus, R., Papaemmanouil, O.: Deep reinforcement learning for join order enumeration. In: Proceedings of the 1st ACM International Workshop on Exploiting Artificial Intelligence Techniques for Data Management (aiMD ’18), pp. 3:1–3:4 (2018)
Krishnan, S., Yang, Z., Goldberg, K., Hellerstein, J.M., Stoica, I.: Learning to optimize join queries with deep reinforcement learning. arXiv:1808.03196 (2018)
Shi, H., Liu, S., Wu, H., Li, R., Liu, S., Kwok, N., Peng, Y.: Oscillatory particle swarm optimizer. Appl. Soft Comput. 73, 316–327 (2018)
Ying, C., Ying, C., Ban, C.: A performance optimization strategy based on degree of parallelism and allocation fitness. EURASIP J. Wirel. Commun. Netw. 2018(1), 1–8 (2018)
Yan, W., Li, G., Wu, Z., Wang, S., Yu, P.S.: Extracting diverse-shapelets for early classification on time series. World Wide Web 23(6), 3055–3081 (2020)
Wu, Z., Cao, Z., Wang, Y.: Multimedia selection operation placement. Multimed. Tools Appl. 54(1), 69–96 (2011)
Wu, Z., Shen, S., Zhou, H., Li, H., Lu, Z., Zou, D.: An effective approach for the protection of user commodity viewing privacy in e-commerce website. Knowl.-Based Syst. 220, 106952 (2021)
Wu, Z., Li, G., Shen, S., Lian, X., Chen, E., Xu, G.: Constructing dummy query sequences to protect location privacy and query privacy in location-based services. World Wide Web 24(1), 25–49 (2021)
Wu, Z., Shen, S., Lian, X., Su, X., Chen, E.: A dummy-based user privacy protection approach for text information retrieval. Knowl.-Based Syst. 195, 105679 (2020)
Yu, X., Li, G., Chai, C., Tang, N.: Reinforcement learning with tree-LSTM for join order selection. In: Proceedings of the 36th IEEE International Conference on Data Engineering (ICDE ’20), pp. 1297–1308 (2020)
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M.A., Fidjeland, A., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., Hassabis, D.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)
Graefe, G.: Rule-Based Query Optimization in Extensible Database Systems. PhD thesis, University of Wisconsin-Madison (1987)
Goetz, G.: The cascades framework for query optimization. IEEE Database Eng. Bull. 18(3), 19–29 (1995)
Goldstein, J., Larson, P.Å.: Optimizing queries using materialized views: A practical, scalable solution. In: Proceedings of the 20th ACM International Conference on Management of Data (SIGMOD ’01), pp. 331–342 (2001)
Tpc-h – homepage. http://www.tpc.org/tpch/ (2021). Accessed 5 Oct 2021
Shapiro, S.S., Wilk, M.B.: An analysis of variance test for normality (complete samples). Biometrika 52(3/4), 591–611 (1965)
Acknowledgements
We would like to appreciate the comments from the anonymous reviewers. This work is funded in part by the China National Science Foundation (Grant No.62072230), Open Project of State Key Laboratory for Novel Software Technology (Grant No. KFKT2021B33), Jiangsu Province Science and Technology Key Program (grant number BE2021729), the Fundamental Research Funds for the Central Universities (No. 020214380089, 020214380098), and the Collaborative Innovation Center of Novel Software Technology and Industrialization.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflicts of interest
The authors have no relevant financial or non-financial interests to disclose.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Gu, R., Zhang, Y., Yin, L. et al. Coral: federated query join order optimization based on deep reinforcement learning. World Wide Web 26, 3093–3118 (2023). https://doi.org/10.1007/s11280-023-01156-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-023-01156-0