iHOME: Index-Based JOIN Query Optimization for Limited Big Data Storage

Sahal, Radhya; Nihad, Marwah; Khafagy, Mohamed H.; Omara, Fatma A.

doi:10.1007/s10723-018-9431-9

iHOME: Index-Based JOIN Query Optimization for Limited Big Data Storage

Published: 06 March 2018

Volume 16, pages 345–380, (2018)
Cite this article

Journal of Grid Computing Aims and scope Submit manuscript

Radhya Sahal¹,
Marwah Nihad²,
Mohamed H. Khafagy³ &
…
Fatma A. Omara¹

230 Accesses
13 Citations
Explore all metrics

Abstract

Query optimization in Big Data becomes a promising research direction due to the popularity of massive data analytical systems such as Hadoop system. The query optimization is getting hard to efficiently execute JOIN queries on top of Hadoop query language, Hive, over limited Big Data storages. According to our previous work, HiveQL Optimization for JOIN query over Multi-session Environment (HOME) system has been introduced over Hadoop system to improve its performance by storing the intermediate results to avoid repeated computations. Time overheads and Big Data storages limitation are considered the main drawback of the HOME system, especially in the case of using additional physical storages or renting extra virtualized storages. In this paper, an index-based system for reusing data called indexing HiveQL Optimization for JOIN over Multi-session Big Data Environment (iHOME) is proposed to overcome HOME overheads by storing only the indexes of the joined rows instead of storing the full intermediate results directly. Moreover, the proposed iHOME system addresses eight cases of JOIN queries which classified into three groups; Similar-to-iHOME, Compute-on-iHOME, and Filter-of-iHOME. According to the experimental results of the iHOME system using TPC-H benchmark, it is found that the execution time of eight JOIN queries using iHOME on Hive has been reduced. Also, the stored data size in the iHOME system is reduced relative to the HOME system, as well as, the Big Data storage is saved. So, by increasing stored data size, the iHOME system guarantees the space scalability and overcomes the storage limitation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient query processing framework for big data warehouse: an almost join-free approach

Article 26 January 2015

Huiju Wang, Xiongpai Qin, … Shan Wang

GPU-based efficient join algorithms on Hadoop

Article 03 April 2020

Hongzhi Wang, Ning Li, … Jianing Li

Improved join operations using ORC in HIVE

Article 08 December 2016

Ruchika Kumar & Naresh Kumar

References

Akerkar, R: Big Data Computing. CRC Press, Boca Raton (2013)
Book Google Scholar
Gkoulalas-Divanis, A., Labbi, A.: Large-Scale Data Analytics. Springer, Berlin (2014)
Book Google Scholar
Chen, C.P., Zhang, C.-Y.: Data-intensive applications, challenges, techniques and technologies: a survey on Big Data. Inf. Sci. 275, 314–347 (2014)
Article Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)
Article Google Scholar
Bajaber, F., Elshawi, R., Batarfi, O., Altalhi, A., Barnawi, A., Sakr, S.: Big data 2.0 processing systems: taxonomy and open challenges. J. Grid Comput. 14, 379–405 (2016)
Article Google Scholar
Khezr, S.N., Navimipour, N.J.: MapReduce and its applications, challenges, and architecture: a comprehensive review and directions for future research. J. Grid Comput. 15, 295–321 (2017)
Article Google Scholar
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2, 1626–1629 (2009)
Article Google Scholar
Abdullah, M.N., Khafagy, M.H., Omara, F.A.: HOME: HiveQL optimization in multi-session environment. In: Proceedings of the 5th European Conference of Computer Science (ECCS14), pp. 80–89 (2014)
Elghandour, I., Aboulnaga, A.: Restore: reusing results of mapreduce jobs. Proc. VLDB Endow. 5, 586–597 (2012)
Article Google Scholar
Gruenheid, A., Omiecinski, E., Mark, L.: Query optimization using column statistics in hive. In: Proceedings of the 15th Symposium on International Database Engineering & Applications, pp. 97–105 (2011)
Zhang, Y., Gao, Q., Gao, L., Wang, C.: iMapReduce: a distributed computing framework for iterative computation. J. Grid Comput. 10, 47–68 (2012)
Article Google Scholar
Olston, C., Reed, B., Silberstein, A., Srivastava, U.: Automatic optimization of parallel dataflow programs. In: USENIX 2008 Annual Technical Conference on Annual Technical Conference, pp. 267–273 (2008)
Dinda, P., Lu, D.: Fast compositional queries in a relational grid information service. J. Grid Comput. 3, 131 (2005)
Article Google Scholar
Nykiel, T., Potamias, M., Mishra, C., Kollios, G., Koudas, N.: MRShare: sharing across multiple queries in MapReduce. Proc. VLDB Endow. 3, 494–505 (2010)
Article MATH Google Scholar
Wang, G., Chan, C.-Y.: Multi-query optimization in mapreduce framework. Proc. VLDB Endow. 7, 145–156 (2013)
Article Google Scholar
Sahal, R., Khafagy, M.H., Omara, F.A.: Comparative study of multi-query optimization techniques using shared predicate-based for Big Data. Int. J. Grid Distrib. Comput. 9, 229–240 (2016)
Article Google Scholar
LeFevre, J., Sankaranarayanan, J., Hacigumus, H., Tatemura, J., Polyzotis, N., Carey, M.J.: Opportunistic physical design for big data analytics. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 851–862 (2014)
LeFevre, J., Sankaranarayanan, J., Hacigumus, H., Tatemura, J., Polyzotis, N., Carey, M.J.: MISO: souping up big data query processing with a multistore system. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 1591–1602 (2014)
Dokeroglu, T., Ozal, S., Bayir, M.A., Cinar, M.S., Cosar, A.: Improving the performance of Hadoop Hive by sharing scan and computation tasks. J. Cloud Comput. 3, 1–11 (2014)
Article Google Scholar
Camacho-Rodríguez, J., Colazzo, D., Herschel, M., Manolescu, I., Chowdhury, S.R.: Reuse-based Optimization for Pig Latin. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 2215–2220 (2016)
Van Hieu, D., Smanchat, S., Meesad, P.: MapReduce join strategies for key-value storage. In: 2014 11th International Joint Conference on Computer Science and Software Engineering (JCSSE), pp. 164–169 (2014)
Wu, S., Li, F., Mehrotra, S., Ooi, B.C.: Query optimization for massively parallel data processing. In: Proceedings of the 2nd ACM Symposium on Cloud Computing, p. 12 (2011)
Yue, M., Gao, H., Shi, S., Wang, H.: Join query processing in data quality management. In: Database Systems for Advanced Applications, pp. 329–342. Springer International Publishing, Cham (2016)
Azez, H.S.A., Khafagy, M.H., Omara, F.A.: JOUM: an indexing methodology for improving join in hive star schema. Int. J. Sci. Eng. Res. 6, 111–119 (2015)
Google Scholar
Abdel Azez, H.S., Khafagy, M.H., Omara, F.A.: Optimizing join in HIVE Star Schema using key/facts indexing. IETE Tech. Rev. 1–12 (2017)
Sahal, R., Khafagy, M.H., Omara, F.A.: Exploiting coarse-grained reused-based opportunities in big data multi-query optimization. J. Comput. Sci., forthcoming (2017)
Mishra, P., Eich, M.H.: Join processing in relational databases. ACM Comput. Surv. (CSUR) 24, 63–113 (1992)
Article Google Scholar
Zhang, X., Chen, L., Wang, M.: Efficient multi-way theta-join processing using MapReduce. Proc. VLDB Endow. 5, 1184–1195 (2012)
Article Google Scholar
Khafagy, M.H.: Index to index two-way join algorithm. Int. J. Digit. Content Technol. Appl. 9, 25 (2015)
Google Scholar
Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 975–986 (2010)
Psaroudakis, I., Athanassoulis, M., Ailamaki, A.: Sharing data and work across concurrent analytical queries. Proc. VLDB Endow. 6, 637–648 (2013)
Article Google Scholar
Dinh, T.T.A., Wenqiang, W., Datta, A.: City on the sky: extending XACML for flexible, secure data sharing on the Cloud. J. Grid Comput. 10, 151–172 (2012)
Article Google Scholar
Strohbach, M., Daubert, J., Ravkin, H., Lischka, M.: Big Data storage. In: New Horizons for a Data-Driven Economy, pp. 119–141. Springer, Berlin (2016)
Kambatla, K., Chen, Y.: The truth about mapreduce performance on ssds. In: 28th Large Installation System Administration Conference (LISA14), pp. 109–118 (2014)
GSP Parser. Available: http://www.sqlparser.com. (2002, Accessed: 24 May 2015, 11:30 pm)
Big data analytics. Available: http://www.webopedia.com/TERM/B/big_data_analytics.html. (Accessed: 12 Feb 2015)
Qin, C., Rusu, F.: PF-OLA: a high-performance framework for parallel online aggregation. Distrib. Parallel Databases 32, 337–375 (2014)
Article Google Scholar
Doulkeridis, C., Nørvåg, K.: A survey of large-scale analytical query processing in MapReduce. VLDB J. 23, 355–380 (2014)
Article Google Scholar
Kaufmann, M., Fischer, P.M., May, N., Kossmann, D.: Benchmarking Databases with History Support. Technical Report, SAP AG (2013)
Liu, H., Xiao, D., Didwania, P., Eltabakh, M.Y.: Exploiting soft and hard correlations in big data query optimization. Proc. VLDB Endow. 9, 1005–1016 (2016)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Computers and Information, Cairo University, Cairo, Egypt
Radhya Sahal & Fatma A. Omara
College of Sciences, University of Kirkuk, Kirkuk, Iraq
Marwah Nihad
Faculty of Computers and Information, Fayoum University, Al Fayoum, Egypt
Mohamed H. Khafagy

Authors

Radhya Sahal
View author publications
You can also search for this author in PubMed Google Scholar
Marwah Nihad
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed H. Khafagy
View author publications
You can also search for this author in PubMed Google Scholar
Fatma A. Omara
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Radhya Sahal.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sahal, R., Nihad, M., Khafagy, M.H. et al. iHOME: Index-Based JOIN Query Optimization for Limited Big Data Storage. J Grid Computing 16, 345–380 (2018). https://doi.org/10.1007/s10723-018-9431-9

Download citation

Received: 24 June 2017
Accepted: 01 February 2018
Published: 06 March 2018
Issue Date: June 2018
DOI: https://doi.org/10.1007/s10723-018-9431-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

iHOME: Index-Based JOIN Query Optimization for Limited Big Data Storage

Abstract

Access this article

Similar content being viewed by others

Efficient query processing framework for big data warehouse: an almost join-free approach

GPU-based efficient join algorithms on Hadoop

Improved join operations using ORC in HIVE

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

iHOME: Index-Based JOIN Query Optimization for Limited Big Data Storage

Abstract

Access this article

Similar content being viewed by others

Efficient query processing framework for big data warehouse: an almost join-free approach

GPU-based efficient join algorithms on Hadoop

Improved join operations using ORC in HIVE

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation