Skip to main content

SQL Query Optimization in Distributed NoSQL Databases for Cloud-Based Applications

  • Conference paper
  • First Online:
Algorithmic Aspects of Cloud Computing (ALGOCLOUD 2022)

Abstract

A method for query optimization is presented by utilizing Spark SQL, a module of Apache Spark that integrates relational data processing. The goal of this paper is to explore NoSQL databases and their effective usage in conjunction with distributed environments to optimize query execution time, in order to accommodate the user complex demands in a cloud computing setting that necessitate the real-time generation of dynamic pages and the provision of dynamic information.

In this work, we investigate query optimization using various query execution paths by combining MongoDB and Spark SQL, aiming to reduce the average query execution time. We achieve this goal by improving the query execution time through a sequence of query execution path scenarios that split the initial query into sub-queries between MongoDB and Spark SQL, along with the use of a mediator between Apache Spark and MongoDB. This mediator transfers either the entire database from MongoDB to Spark, or transfers a subset of the results for those sub-queries executed in MongoDB. Our experimental results with eight different query execution path scenarios and six difference database sizes demonstrate the clear superiority and scalability of a specific scenario.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 44.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 59.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Available at: https://www.mongodb.com/products/compass.

  2. 2.

    Available at: https://www.mongodb.com/docs/spark-connector/current/.

  3. 3.

    The standard range notation is used where “[” and “]” denote inclusive bounds and “(” and “)” denote exclusive bounds.

  4. 4.

    Hotspots in sharded clusters refer to situations where a specific chunk of data receives a disproportionate amount of read and write operations, causing performance issues.

References

  1. Abdel-Fattah, M.A., Mohamed, W., Abdelgaber, S.: A comprehensive spark-based layer for converting relational databases to NoSQL. Big Data Cogn. Comput. 6(3), 71 (2022). https://doi.org/10.3390/bdcc6030071

  2. Ali, W., Saleem, M., Yao, B., Hogan, A., Ngomo, A.-C.N.: A survey of RDF stores & SPARQL engines for querying knowledge graphs. VLDB J. 31, 1–26 (2021). https://doi.org/10.1007/s00778-021-00711-3

    Article  Google Scholar 

  3. Anusha, K., Usha Rani, K.: Performance evaluation of spark SQL for batch processing. In: Venkata Krishna, P., Obaidat, M.S. (eds.) Emerging Research in Data Engineering Systems and Computer Communications. AISC, vol. 1054, pp. 145–153. Springer, Singapore (2020). https://doi.org/10.1007/978-981-15-0135-7_13

    Chapter  Google Scholar 

  4. Apache: Hadoop. https://hadoop.apache.org/. Accessed 17 Jan 2023

  5. Apache: HBase. http://hbase.apache.org/. Accessed 17 Jan 2023

  6. Apache: Spark. https://spark.apache.org/. Accessed 17 Jan 2023

  7. Apache: Storm. https://storm.apache.org/. Accessed 17 Jan 2023

  8. Babcock, B., Chaudhuri, S.: Towards a robust query optimizer: a principled and practical approach. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, pp. 119–130 (2005)

    Google Scholar 

  9. Behm, A., Behm, A., et al.: ASTERIX: towards a scalable, semistructured data platform for evolving-world models. Distrib. Parall. Databases 29(3), 185–216 (2011)

    Article  Google Scholar 

  10. Celesti, A., et al.: Information management in IoT cloud-based tele-rehabilitation as a service for smart cities: Comparison of NoSQL approaches. Measurement 151, 107218 (2020). https://doi.org/10.1016/j.measurement.2019.107218

    Article  Google Scholar 

  11. Chambers, C., et al.: Flumejava: easy, efficient data-parallel pipelines. ACM SIGPLAN Notices 45(6), 363–375 (2010)

    Article  Google Scholar 

  12. Chawla, T., Singh, G., Pilli, E.S., Govil, M.: Storage, partitioning, indexing and retrieval in big RDF frameworks: a survey. Comput. Sci. Rev. 38, 100309 (2020). https://doi.org/10.1016/j.cosrev.2020.100309

    Article  Google Scholar 

  13. Chen, Y., Özsu, M.T., Xiao, G., Tang, Z., Li, K.: GSmart: an efficient SPARQL query engine using sparse matrix algebra - full version. arXiv preprint arXiv:2106.14038 (2021)

  14. Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51(1), 107–113 (2008). https://doi.org/10.1145/1327452.1327492

  15. Eyada, M.M., Saber, W., El Genidy, M.M., Amer, F.: Performance evaluation of IoT data management using MongoDB versus MySQL databases in different cloud environments. IEEE Access 8, 110656–110668 (2020). https://doi.org/10.1109/ACCESS.2020.3002164

    Article  Google Scholar 

  16. Gupta, A., Jain, S.: Optimizing performance of real-time big data stateful streaming applications on cloud. In: 2022 IEEE International Conference on Big Data and Smart Computing (BigComp), pp. 1–4 (2022). https://doi.org/10.1109/BigComp54360.2022.00010

  17. Győrödi, C., Győrödi, R., Pecherle, G., Olah, A.: A comparative study: MongoDB vs. MySQL. In: 2015 13th International Conference on Engineering of Modern Electric Systems (EMES), pp. 1–6. IEEE (2015)

    Google Scholar 

  18. Isard, M., Yu, Y.: Distributed data-parallel computing using a high-level programming language. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, pp. 987–994 (2009)

    Google Scholar 

  19. Izenov, Y., Datta, A., Rusu, F., Shin, J.H.: COMPASS: Online sketch-based query optimization for in-memory databases. In: Proceedings of the 2021 International Conference on Management of Data, pp. 804–816 (2021)

    Google Scholar 

  20. Karras, A., Karras, C., Samoladas, D., Giotopoulos, K.C., Sioutas, S.: Query optimization in NoSQL databases using an enhanced localized R-tree index. In: Pardede, E., Delir Haghighi, P., Khalil, I., Kotsis, G. (eds.) Information Integration and Web Intelligence, pp. 391–398. Springer Nature Switzerland, Cham (2022)

    Chapter  Google Scholar 

  21. Li, Z.: Geospatial big data handling with high performance computing: current approaches and future directions. In: Tang, W., Wang, S. (eds.) High Performance Computing for Geospatial Applications. GE, vol. 23, pp. 53–76. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-47998-5_4

    Chapter  Google Scholar 

  22. Makris, A., Tserpes, K., Andronikou, V., Anagnostopoulos, D.: A classification of NoSQL data stores based on key design characteristics. Procedia Comput. Sci. 97, 94–103 (2016). https://doi.org/10.1016/j.procs.2016.08.284, 2nd International Conference on Cloud Forward: From Distributed to Complete Computing

  23. Marcus, R., Negi, P., Mao, H., Tatbul, N., Alizadeh, M., Kraska, T.: Bao: making learned query optimization practical. ACM SIGMOD Rec. 51(1), 6–13 (2022)

    Article  Google Scholar 

  24. Marcus, R., et al.: Neo: a Learned Query Optimizer. Proc. VLDB Endow. 12(11), 1705–1718 (2019). https://doi.org/10.14778/3342263.3342644

  25. Markl, V., Lohman, G.M., Raman, V.: LEO: An autonomic query optimizer for DB2. IBM Syst. J. 42(1), 98–106 (2003)

    Article  Google Scholar 

  26. Melnik, S., et al.: Dremel: interactive analysis of web-scale datasets. Proceed. VLDB Endow. 3(1–2), 330–339 (2010)

    Article  Google Scholar 

  27. MongoDB Inc.: MongoDB. https://www.mongodb.com/. Accessed 24 Dec 2022

  28. Salloum, S., Dautov, R., Chen, X., Peng, P.X., Huang, J.Z.: Big data analytics on Apache Spark. Int. J. Data Sci. Anal. 1(3), 145–164 (2016). https://doi.org/10.1007/s41060-016-0027-9

    Article  Google Scholar 

  29. Sellami, R., Defude, B.: Complex queries optimization and evaluation over relational and NoSQL data stores in cloud environments. IEEE Trans. Big Data 4(2), 217–230 (2017)

    Google Scholar 

  30. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–10. IEEE (2010)

    Google Scholar 

  31. Thusoo, A., et al.: Hive-a petabyte scale data warehouse using Hadoop. In: 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010), pp. 996–1005. IEEE (2010)

    Google Scholar 

  32. Vaisman, A., Zimányi, E.: Recent Developments in Big Data Warehouses. In: Data Warehouse Systems. Data-Centric Systems and Applications, pp. 561–631. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-662-65167-4_15

  33. Xin, R.S., Rosen, J., Zaharia, M., Franklin, M.J., Shenker, S., Stoica, I.: Shark: SQL and rich analytics at scale. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 13–24 (2013)

    Google Scholar 

  34. Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pp. 15–28 (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christos Zaroliagis .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Karras, A., Karras, C., Pervanas, A., Sioutas, S., Zaroliagis, C. (2023). SQL Query Optimization in Distributed NoSQL Databases for Cloud-Based Applications. In: Foschini, L., Kontogiannis, S. (eds) Algorithmic Aspects of Cloud Computing. ALGOCLOUD 2022. Lecture Notes in Computer Science, vol 13799. Springer, Cham. https://doi.org/10.1007/978-3-031-33437-5_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-33437-5_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-33436-8

  • Online ISBN: 978-3-031-33437-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics