Skip to main content
Log in

iHOME: Index-Based JOIN Query Optimization for Limited Big Data Storage

  • Published:
Journal of Grid Computing Aims and scope Submit manuscript

Abstract

Query optimization in Big Data becomes a promising research direction due to the popularity of massive data analytical systems such as Hadoop system. The query optimization is getting hard to efficiently execute JOIN queries on top of Hadoop query language, Hive, over limited Big Data storages. According to our previous work, HiveQL Optimization for JOIN query over Multi-session Environment (HOME) system has been introduced over Hadoop system to improve its performance by storing the intermediate results to avoid repeated computations. Time overheads and Big Data storages limitation are considered the main drawback of the HOME system, especially in the case of using additional physical storages or renting extra virtualized storages. In this paper, an index-based system for reusing data called indexing HiveQL Optimization for JOIN over Multi-session Big Data Environment (iHOME) is proposed to overcome HOME overheads by storing only the indexes of the joined rows instead of storing the full intermediate results directly. Moreover, the proposed iHOME system addresses eight cases of JOIN queries which classified into three groups; Similar-to-iHOME, Compute-on-iHOME, and Filter-of-iHOME. According to the experimental results of the iHOME system using TPC-H benchmark, it is found that the execution time of eight JOIN queries using iHOME on Hive has been reduced. Also, the stored data size in the iHOME system is reduced relative to the HOME system, as well as, the Big Data storage is saved. So, by increasing stored data size, the iHOME system guarantees the space scalability and overcomes the storage limitation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Akerkar, R: Big Data Computing. CRC Press, Boca Raton (2013)

    Book  Google Scholar 

  2. Gkoulalas-Divanis, A., Labbi, A.: Large-Scale Data Analytics. Springer, Berlin (2014)

    Book  Google Scholar 

  3. Chen, C.P., Zhang, C.-Y.: Data-intensive applications, challenges, techniques and technologies: a survey on Big Data. Inf. Sci. 275, 314–347 (2014)

    Article  Google Scholar 

  4. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)

    Article  Google Scholar 

  5. Bajaber, F., Elshawi, R., Batarfi, O., Altalhi, A., Barnawi, A., Sakr, S.: Big data 2.0 processing systems: taxonomy and open challenges. J. Grid Comput. 14, 379–405 (2016)

    Article  Google Scholar 

  6. Khezr, S.N., Navimipour, N.J.: MapReduce and its applications, challenges, and architecture: a comprehensive review and directions for future research. J. Grid Comput. 15, 295–321 (2017)

    Article  Google Scholar 

  7. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2, 1626–1629 (2009)

    Article  Google Scholar 

  8. Abdullah, M.N., Khafagy, M.H., Omara, F.A.: HOME: HiveQL optimization in multi-session environment. In: Proceedings of the 5th European Conference of Computer Science (ECCS14), pp. 80–89 (2014)

  9. Elghandour, I., Aboulnaga, A.: Restore: reusing results of mapreduce jobs. Proc. VLDB Endow. 5, 586–597 (2012)

    Article  Google Scholar 

  10. Gruenheid, A., Omiecinski, E., Mark, L.: Query optimization using column statistics in hive. In: Proceedings of the 15th Symposium on International Database Engineering & Applications, pp. 97–105 (2011)

  11. Zhang, Y., Gao, Q., Gao, L., Wang, C.: iMapReduce: a distributed computing framework for iterative computation. J. Grid Comput. 10, 47–68 (2012)

    Article  Google Scholar 

  12. Olston, C., Reed, B., Silberstein, A., Srivastava, U.: Automatic optimization of parallel dataflow programs. In: USENIX 2008 Annual Technical Conference on Annual Technical Conference, pp. 267–273 (2008)

  13. Dinda, P., Lu, D.: Fast compositional queries in a relational grid information service. J. Grid Comput. 3, 131 (2005)

    Article  Google Scholar 

  14. Nykiel, T., Potamias, M., Mishra, C., Kollios, G., Koudas, N.: MRShare: sharing across multiple queries in MapReduce. Proc. VLDB Endow. 3, 494–505 (2010)

    Article  MATH  Google Scholar 

  15. Wang, G., Chan, C.-Y.: Multi-query optimization in mapreduce framework. Proc. VLDB Endow. 7, 145–156 (2013)

    Article  Google Scholar 

  16. Sahal, R., Khafagy, M.H., Omara, F.A.: Comparative study of multi-query optimization techniques using shared predicate-based for Big Data. Int. J. Grid Distrib. Comput. 9, 229–240 (2016)

    Article  Google Scholar 

  17. LeFevre, J., Sankaranarayanan, J., Hacigumus, H., Tatemura, J., Polyzotis, N., Carey, M.J.: Opportunistic physical design for big data analytics. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 851–862 (2014)

  18. LeFevre, J., Sankaranarayanan, J., Hacigumus, H., Tatemura, J., Polyzotis, N., Carey, M.J.: MISO: souping up big data query processing with a multistore system. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 1591–1602 (2014)

  19. Dokeroglu, T., Ozal, S., Bayir, M.A., Cinar, M.S., Cosar, A.: Improving the performance of Hadoop Hive by sharing scan and computation tasks. J. Cloud Comput. 3, 1–11 (2014)

    Article  Google Scholar 

  20. Camacho-Rodríguez, J., Colazzo, D., Herschel, M., Manolescu, I., Chowdhury, S.R.: Reuse-based Optimization for Pig Latin. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 2215–2220 (2016)

  21. Van Hieu, D., Smanchat, S., Meesad, P.: MapReduce join strategies for key-value storage. In: 2014 11th International Joint Conference on Computer Science and Software Engineering (JCSSE), pp. 164–169 (2014)

  22. Wu, S., Li, F., Mehrotra, S., Ooi, B.C.: Query optimization for massively parallel data processing. In: Proceedings of the 2nd ACM Symposium on Cloud Computing, p. 12 (2011)

  23. Yue, M., Gao, H., Shi, S., Wang, H.: Join query processing in data quality management. In: Database Systems for Advanced Applications, pp. 329–342. Springer International Publishing, Cham (2016)

  24. Azez, H.S.A., Khafagy, M.H., Omara, F.A.: JOUM: an indexing methodology for improving join in hive star schema. Int. J. Sci. Eng. Res. 6, 111–119 (2015)

    Google Scholar 

  25. Abdel Azez, H.S., Khafagy, M.H., Omara, F.A.: Optimizing join in HIVE Star Schema using key/facts indexing. IETE Tech. Rev. 1–12 (2017)

  26. Sahal, R., Khafagy, M.H., Omara, F.A.: Exploiting coarse-grained reused-based opportunities in big data multi-query optimization. J. Comput. Sci., forthcoming (2017)

  27. Mishra, P., Eich, M.H.: Join processing in relational databases. ACM Comput. Surv. (CSUR) 24, 63–113 (1992)

    Article  Google Scholar 

  28. Zhang, X., Chen, L., Wang, M.: Efficient multi-way theta-join processing using MapReduce. Proc. VLDB Endow. 5, 1184–1195 (2012)

    Article  Google Scholar 

  29. Khafagy, M.H.: Index to index two-way join algorithm. Int. J. Digit. Content Technol. Appl. 9, 25 (2015)

    Google Scholar 

  30. Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 975–986 (2010)

  31. Psaroudakis, I., Athanassoulis, M., Ailamaki, A.: Sharing data and work across concurrent analytical queries. Proc. VLDB Endow. 6, 637–648 (2013)

    Article  Google Scholar 

  32. Dinh, T.T.A., Wenqiang, W., Datta, A.: City on the sky: extending XACML for flexible, secure data sharing on the Cloud. J. Grid Comput. 10, 151–172 (2012)

    Article  Google Scholar 

  33. Strohbach, M., Daubert, J., Ravkin, H., Lischka, M.: Big Data storage. In: New Horizons for a Data-Driven Economy, pp. 119–141. Springer, Berlin (2016)

  34. Kambatla, K., Chen, Y.: The truth about mapreduce performance on ssds. In: 28th Large Installation System Administration Conference (LISA14), pp. 109–118 (2014)

  35. GSP Parser. Available: http://www.sqlparser.com. (2002, Accessed: 24 May 2015, 11:30 pm)

  36. Big data analytics. Available: http://www.webopedia.com/TERM/B/big_data_analytics.html. (Accessed: 12 Feb 2015)

  37. Qin, C., Rusu, F.: PF-OLA: a high-performance framework for parallel online aggregation. Distrib. Parallel Databases 32, 337–375 (2014)

    Article  Google Scholar 

  38. Doulkeridis, C., Nørvåg, K.: A survey of large-scale analytical query processing in MapReduce. VLDB J. 23, 355–380 (2014)

    Article  Google Scholar 

  39. Kaufmann, M., Fischer, P.M., May, N., Kossmann, D.: Benchmarking Databases with History Support. Technical Report, SAP AG (2013)

  40. Liu, H., Xiao, D., Didwania, P., Eltabakh, M.Y.: Exploiting soft and hard correlations in big data query optimization. Proc. VLDB Endow. 9, 1005–1016 (2016)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Radhya Sahal.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sahal, R., Nihad, M., Khafagy, M.H. et al. iHOME: Index-Based JOIN Query Optimization for Limited Big Data Storage. J Grid Computing 16, 345–380 (2018). https://doi.org/10.1007/s10723-018-9431-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10723-018-9431-9

Keywords

Navigation