SQL Query Optimization in Distributed NoSQL Databases for Cloud-Based Applications

Karras, Aristeidis; Karras, Christos; Pervanas, Antonios; Sioutas, Spyros; Zaroliagis, Christos

doi:10.1007/978-3-031-33437-5_2

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13799))

Included in the following conference series:

International Symposium on Algorithmic Aspects of Cloud Computing

207 Accesses
1 Citations

Abstract

A method for query optimization is presented by utilizing Spark SQL, a module of Apache Spark that integrates relational data processing. The goal of this paper is to explore NoSQL databases and their effective usage in conjunction with distributed environments to optimize query execution time, in order to accommodate the user complex demands in a cloud computing setting that necessitate the real-time generation of dynamic pages and the provision of dynamic information.

In this work, we investigate query optimization using various query execution paths by combining MongoDB and Spark SQL, aiming to reduce the average query execution time. We achieve this goal by improving the query execution time through a sequence of query execution path scenarios that split the initial query into sub-queries between MongoDB and Spark SQL, along with the use of a mediator between Apache Spark and MongoDB. This mediator transfers either the entire database from MongoDB to Spark, or transfers a subset of the results for those sub-queries executed in MongoDB. Our experimental results with eight different query execution path scenarios and six difference database sizes demonstrate the clear superiority and scalability of a specific scenario.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 44.99; Price excludes VAT (USA)

Softcover Book: USD 59.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Available at: https://www.mongodb.com/products/compass.
2.
Available at: https://www.mongodb.com/docs/spark-connector/current/.
3.
The standard range notation is used where “[” and “]” denote inclusive bounds and “(” and “)” denote exclusive bounds.
4.
Hotspots in sharded clusters refer to situations where a specific chunk of data receives a disproportionate amount of read and write operations, causing performance issues.

References

Abdel-Fattah, M.A., Mohamed, W., Abdelgaber, S.: A comprehensive spark-based layer for converting relational databases to NoSQL. Big Data Cogn. Comput. 6(3), 71 (2022). https://doi.org/10.3390/bdcc6030071
Ali, W., Saleem, M., Yao, B., Hogan, A., Ngomo, A.-C.N.: A survey of RDF stores & SPARQL engines for querying knowledge graphs. VLDB J. 31, 1–26 (2021). https://doi.org/10.1007/s00778-021-00711-3
Article Google Scholar
Anusha, K., Usha Rani, K.: Performance evaluation of spark SQL for batch processing. In: Venkata Krishna, P., Obaidat, M.S. (eds.) Emerging Research in Data Engineering Systems and Computer Communications. AISC, vol. 1054, pp. 145–153. Springer, Singapore (2020). https://doi.org/10.1007/978-981-15-0135-7_13
Chapter Google Scholar
Apache: Hadoop. https://hadoop.apache.org/. Accessed 17 Jan 2023
Apache: HBase. http://hbase.apache.org/. Accessed 17 Jan 2023
Apache: Spark. https://spark.apache.org/. Accessed 17 Jan 2023
Apache: Storm. https://storm.apache.org/. Accessed 17 Jan 2023
Babcock, B., Chaudhuri, S.: Towards a robust query optimizer: a principled and practical approach. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, pp. 119–130 (2005)
Google Scholar
Behm, A., Behm, A., et al.: ASTERIX: towards a scalable, semistructured data platform for evolving-world models. Distrib. Parall. Databases 29(3), 185–216 (2011)
Article Google Scholar
Celesti, A., et al.: Information management in IoT cloud-based tele-rehabilitation as a service for smart cities: Comparison of NoSQL approaches. Measurement 151, 107218 (2020). https://doi.org/10.1016/j.measurement.2019.107218
Article Google Scholar
Chambers, C., et al.: Flumejava: easy, efficient data-parallel pipelines. ACM SIGPLAN Notices 45(6), 363–375 (2010)
Article Google Scholar
Chawla, T., Singh, G., Pilli, E.S., Govil, M.: Storage, partitioning, indexing and retrieval in big RDF frameworks: a survey. Comput. Sci. Rev. 38, 100309 (2020). https://doi.org/10.1016/j.cosrev.2020.100309
Article Google Scholar
Chen, Y., Özsu, M.T., Xiao, G., Tang, Z., Li, K.: GSmart: an efficient SPARQL query engine using sparse matrix algebra - full version. arXiv preprint arXiv:2106.14038 (2021)
Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51(1), 107–113 (2008). https://doi.org/10.1145/1327452.1327492
Eyada, M.M., Saber, W., El Genidy, M.M., Amer, F.: Performance evaluation of IoT data management using MongoDB versus MySQL databases in different cloud environments. IEEE Access 8, 110656–110668 (2020). https://doi.org/10.1109/ACCESS.2020.3002164
Article Google Scholar
Gupta, A., Jain, S.: Optimizing performance of real-time big data stateful streaming applications on cloud. In: 2022 IEEE International Conference on Big Data and Smart Computing (BigComp), pp. 1–4 (2022). https://doi.org/10.1109/BigComp54360.2022.00010
Győrödi, C., Győrödi, R., Pecherle, G., Olah, A.: A comparative study: MongoDB vs. MySQL. In: 2015 13th International Conference on Engineering of Modern Electric Systems (EMES), pp. 1–6. IEEE (2015)
Google Scholar
Isard, M., Yu, Y.: Distributed data-parallel computing using a high-level programming language. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, pp. 987–994 (2009)
Google Scholar
Izenov, Y., Datta, A., Rusu, F., Shin, J.H.: COMPASS: Online sketch-based query optimization for in-memory databases. In: Proceedings of the 2021 International Conference on Management of Data, pp. 804–816 (2021)
Google Scholar
Karras, A., Karras, C., Samoladas, D., Giotopoulos, K.C., Sioutas, S.: Query optimization in NoSQL databases using an enhanced localized R-tree index. In: Pardede, E., Delir Haghighi, P., Khalil, I., Kotsis, G. (eds.) Information Integration and Web Intelligence, pp. 391–398. Springer Nature Switzerland, Cham (2022)
Chapter Google Scholar
Li, Z.: Geospatial big data handling with high performance computing: current approaches and future directions. In: Tang, W., Wang, S. (eds.) High Performance Computing for Geospatial Applications. GE, vol. 23, pp. 53–76. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-47998-5_4
Chapter Google Scholar
Makris, A., Tserpes, K., Andronikou, V., Anagnostopoulos, D.: A classification of NoSQL data stores based on key design characteristics. Procedia Comput. Sci. 97, 94–103 (2016). https://doi.org/10.1016/j.procs.2016.08.284, 2nd International Conference on Cloud Forward: From Distributed to Complete Computing
Marcus, R., Negi, P., Mao, H., Tatbul, N., Alizadeh, M., Kraska, T.: Bao: making learned query optimization practical. ACM SIGMOD Rec. 51(1), 6–13 (2022)
Article Google Scholar
Marcus, R., et al.: Neo: a Learned Query Optimizer. Proc. VLDB Endow. 12(11), 1705–1718 (2019). https://doi.org/10.14778/3342263.3342644
Markl, V., Lohman, G.M., Raman, V.: LEO: An autonomic query optimizer for DB2. IBM Syst. J. 42(1), 98–106 (2003)
Article Google Scholar
Melnik, S., et al.: Dremel: interactive analysis of web-scale datasets. Proceed. VLDB Endow. 3(1–2), 330–339 (2010)
Article Google Scholar
MongoDB Inc.: MongoDB. https://www.mongodb.com/. Accessed 24 Dec 2022
Salloum, S., Dautov, R., Chen, X., Peng, P.X., Huang, J.Z.: Big data analytics on Apache Spark. Int. J. Data Sci. Anal. 1(3), 145–164 (2016). https://doi.org/10.1007/s41060-016-0027-9
Article Google Scholar
Sellami, R., Defude, B.: Complex queries optimization and evaluation over relational and NoSQL data stores in cloud environments. IEEE Trans. Big Data 4(2), 217–230 (2017)
Google Scholar
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–10. IEEE (2010)
Google Scholar
Thusoo, A., et al.: Hive-a petabyte scale data warehouse using Hadoop. In: 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010), pp. 996–1005. IEEE (2010)
Google Scholar
Vaisman, A., Zimányi, E.: Recent Developments in Big Data Warehouses. In: Data Warehouse Systems. Data-Centric Systems and Applications, pp. 561–631. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-662-65167-4_15
Xin, R.S., Rosen, J., Zaharia, M., Franklin, M.J., Shenker, S., Stoica, I.: Shark: SQL and rich analytics at scale. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 13–24 (2013)
Google Scholar
Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pp. 15–28 (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Engineering and Informatics Department, University of Patras, 26504, Patras, Greece
Aristeidis Karras, Christos Karras, Antonios Pervanas, Spyros Sioutas & Christos Zaroliagis
Computer Technology Institute and Press “Diophantus”, Patras University Campus, 26504, Patras, Greece
Christos Zaroliagis

Authors

Aristeidis Karras
View author publications
You can also search for this author in PubMed Google Scholar
Christos Karras
View author publications
You can also search for this author in PubMed Google Scholar
Antonios Pervanas
View author publications
You can also search for this author in PubMed Google Scholar
Spyros Sioutas
View author publications
You can also search for this author in PubMed Google Scholar
Christos Zaroliagis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christos Zaroliagis .

Editor information

Editors and Affiliations

University of Bologna, Bologna, Italy
Luca Foschini
University of Patras, Rio, Greece
Spyros Kontogiannis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Karras, A., Karras, C., Pervanas, A., Sioutas, S., Zaroliagis, C. (2023). SQL Query Optimization in Distributed NoSQL Databases for Cloud-Based Applications. In: Foschini, L., Kontogiannis, S. (eds) Algorithmic Aspects of Cloud Computing. ALGOCLOUD 2022. Lecture Notes in Computer Science, vol 13799. Springer, Cham. https://doi.org/10.1007/978-3-031-33437-5_2

Download citation

DOI: https://doi.org/10.1007/978-3-031-33437-5_2
Published: 26 May 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-33436-8
Online ISBN: 978-3-031-33437-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

SQL Query Optimization in Distributed NoSQL Databases for Cloud-Based Applications