Abstract
Query optimization in Big Data becomes a promising research direction due to the popularity of massive data analytical systems such as Hadoop system. The query optimization is getting hard to efficiently execute JOIN queries on top of Hadoop query language, Hive, over limited Big Data storages. According to our previous work, HiveQL Optimization for JOIN query over Multi-session Environment (HOME) system has been introduced over Hadoop system to improve its performance by storing the intermediate results to avoid repeated computations. Time overheads and Big Data storages limitation are considered the main drawback of the HOME system, especially in the case of using additional physical storages or renting extra virtualized storages. In this paper, an index-based system for reusing data called indexing HiveQL Optimization for JOIN over Multi-session Big Data Environment (iHOME) is proposed to overcome HOME overheads by storing only the indexes of the joined rows instead of storing the full intermediate results directly. Moreover, the proposed iHOME system addresses eight cases of JOIN queries which classified into three groups; Similar-to-iHOME, Compute-on-iHOME, and Filter-of-iHOME. According to the experimental results of the iHOME system using TPC-H benchmark, it is found that the execution time of eight JOIN queries using iHOME on Hive has been reduced. Also, the stored data size in the iHOME system is reduced relative to the HOME system, as well as, the Big Data storage is saved. So, by increasing stored data size, the iHOME system guarantees the space scalability and overcomes the storage limitation.
Similar content being viewed by others
References
Akerkar, R: Big Data Computing. CRC Press, Boca Raton (2013)
Gkoulalas-Divanis, A., Labbi, A.: Large-Scale Data Analytics. Springer, Berlin (2014)
Chen, C.P., Zhang, C.-Y.: Data-intensive applications, challenges, techniques and technologies: a survey on Big Data. Inf. Sci. 275, 314–347 (2014)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)
Bajaber, F., Elshawi, R., Batarfi, O., Altalhi, A., Barnawi, A., Sakr, S.: Big data 2.0 processing systems: taxonomy and open challenges. J. Grid Comput. 14, 379–405 (2016)
Khezr, S.N., Navimipour, N.J.: MapReduce and its applications, challenges, and architecture: a comprehensive review and directions for future research. J. Grid Comput. 15, 295–321 (2017)
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2, 1626–1629 (2009)
Abdullah, M.N., Khafagy, M.H., Omara, F.A.: HOME: HiveQL optimization in multi-session environment. In: Proceedings of the 5th European Conference of Computer Science (ECCS14), pp. 80–89 (2014)
Elghandour, I., Aboulnaga, A.: Restore: reusing results of mapreduce jobs. Proc. VLDB Endow. 5, 586–597 (2012)
Gruenheid, A., Omiecinski, E., Mark, L.: Query optimization using column statistics in hive. In: Proceedings of the 15th Symposium on International Database Engineering & Applications, pp. 97–105 (2011)
Zhang, Y., Gao, Q., Gao, L., Wang, C.: iMapReduce: a distributed computing framework for iterative computation. J. Grid Comput. 10, 47–68 (2012)
Olston, C., Reed, B., Silberstein, A., Srivastava, U.: Automatic optimization of parallel dataflow programs. In: USENIX 2008 Annual Technical Conference on Annual Technical Conference, pp. 267–273 (2008)
Dinda, P., Lu, D.: Fast compositional queries in a relational grid information service. J. Grid Comput. 3, 131 (2005)
Nykiel, T., Potamias, M., Mishra, C., Kollios, G., Koudas, N.: MRShare: sharing across multiple queries in MapReduce. Proc. VLDB Endow. 3, 494–505 (2010)
Wang, G., Chan, C.-Y.: Multi-query optimization in mapreduce framework. Proc. VLDB Endow. 7, 145–156 (2013)
Sahal, R., Khafagy, M.H., Omara, F.A.: Comparative study of multi-query optimization techniques using shared predicate-based for Big Data. Int. J. Grid Distrib. Comput. 9, 229–240 (2016)
LeFevre, J., Sankaranarayanan, J., Hacigumus, H., Tatemura, J., Polyzotis, N., Carey, M.J.: Opportunistic physical design for big data analytics. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 851–862 (2014)
LeFevre, J., Sankaranarayanan, J., Hacigumus, H., Tatemura, J., Polyzotis, N., Carey, M.J.: MISO: souping up big data query processing with a multistore system. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 1591–1602 (2014)
Dokeroglu, T., Ozal, S., Bayir, M.A., Cinar, M.S., Cosar, A.: Improving the performance of Hadoop Hive by sharing scan and computation tasks. J. Cloud Comput. 3, 1–11 (2014)
Camacho-Rodríguez, J., Colazzo, D., Herschel, M., Manolescu, I., Chowdhury, S.R.: Reuse-based Optimization for Pig Latin. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 2215–2220 (2016)
Van Hieu, D., Smanchat, S., Meesad, P.: MapReduce join strategies for key-value storage. In: 2014 11th International Joint Conference on Computer Science and Software Engineering (JCSSE), pp. 164–169 (2014)
Wu, S., Li, F., Mehrotra, S., Ooi, B.C.: Query optimization for massively parallel data processing. In: Proceedings of the 2nd ACM Symposium on Cloud Computing, p. 12 (2011)
Yue, M., Gao, H., Shi, S., Wang, H.: Join query processing in data quality management. In: Database Systems for Advanced Applications, pp. 329–342. Springer International Publishing, Cham (2016)
Azez, H.S.A., Khafagy, M.H., Omara, F.A.: JOUM: an indexing methodology for improving join in hive star schema. Int. J. Sci. Eng. Res. 6, 111–119 (2015)
Abdel Azez, H.S., Khafagy, M.H., Omara, F.A.: Optimizing join in HIVE Star Schema using key/facts indexing. IETE Tech. Rev. 1–12 (2017)
Sahal, R., Khafagy, M.H., Omara, F.A.: Exploiting coarse-grained reused-based opportunities in big data multi-query optimization. J. Comput. Sci., forthcoming (2017)
Mishra, P., Eich, M.H.: Join processing in relational databases. ACM Comput. Surv. (CSUR) 24, 63–113 (1992)
Zhang, X., Chen, L., Wang, M.: Efficient multi-way theta-join processing using MapReduce. Proc. VLDB Endow. 5, 1184–1195 (2012)
Khafagy, M.H.: Index to index two-way join algorithm. Int. J. Digit. Content Technol. Appl. 9, 25 (2015)
Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 975–986 (2010)
Psaroudakis, I., Athanassoulis, M., Ailamaki, A.: Sharing data and work across concurrent analytical queries. Proc. VLDB Endow. 6, 637–648 (2013)
Dinh, T.T.A., Wenqiang, W., Datta, A.: City on the sky: extending XACML for flexible, secure data sharing on the Cloud. J. Grid Comput. 10, 151–172 (2012)
Strohbach, M., Daubert, J., Ravkin, H., Lischka, M.: Big Data storage. In: New Horizons for a Data-Driven Economy, pp. 119–141. Springer, Berlin (2016)
Kambatla, K., Chen, Y.: The truth about mapreduce performance on ssds. In: 28th Large Installation System Administration Conference (LISA14), pp. 109–118 (2014)
GSP Parser. Available: http://www.sqlparser.com. (2002, Accessed: 24 May 2015, 11:30 pm)
Big data analytics. Available: http://www.webopedia.com/TERM/B/big_data_analytics.html. (Accessed: 12 Feb 2015)
Qin, C., Rusu, F.: PF-OLA: a high-performance framework for parallel online aggregation. Distrib. Parallel Databases 32, 337–375 (2014)
Doulkeridis, C., Nørvåg, K.: A survey of large-scale analytical query processing in MapReduce. VLDB J. 23, 355–380 (2014)
Kaufmann, M., Fischer, P.M., May, N., Kossmann, D.: Benchmarking Databases with History Support. Technical Report, SAP AG (2013)
Liu, H., Xiao, D., Didwania, P., Eltabakh, M.Y.: Exploiting soft and hard correlations in big data query optimization. Proc. VLDB Endow. 9, 1005–1016 (2016)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Sahal, R., Nihad, M., Khafagy, M.H. et al. iHOME: Index-Based JOIN Query Optimization for Limited Big Data Storage. J Grid Computing 16, 345–380 (2018). https://doi.org/10.1007/s10723-018-9431-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10723-018-9431-9