Improving Hadoop Hive Query Response Times Through Efficient Virtual Resource Allocation

Dokeroglu, Tansel; Cınar, Muhammet Serkan; Sert, Seyyit Alper; Cosar, Ahmet; Yazıcı, Adnan

doi:10.1007/978-3-319-26154-6_17

Tansel Dokeroglu¹²,
Muhammet Serkan Cınar¹²,
Seyyit Alper Sert¹²,
Ahmet Cosar¹² &
…
Adnan Yazıcı¹²

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 400))

610 Accesses
2 Citations

Abstract

The performance of the MapReduce-based Cloud data warehouses mainly depends on the virtual hardware resources allocated. Most of the time, the resources are values selected/given by the Cloud service providers. However, setting the right virtual resources in accordance with the workload demands of a query, such as the number of CPUs, the size of RAM, and the network bandwidth, will improve the response time when querying large data on an optimized system. In this study, we carried out a set of experiments with a well-known Mapreduce SQL-translator, Hadoop Hive, on benchmark decision support the TPC benchmark (TPC-H) database in order to analyze the performance sensitivity of the queries under different virtual resource settings. Our results provide valuable hints for the decision makers who design efficient MapReduce-based data warehouses on the Cloud.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Amazon Web Services (AWS). aws.amazon.com (last accessed September 5, 2014)
Google Scholar
Google App Engine. http://code.google.com/appengine/ (last accessed September 5, 2014)
Windows Azure Platform. microsoft.com/windowsazure/ (last accessed September 5)
Google Scholar
Apache Hadoop. http://hadoop.apache.org/ (last accessed May 1, 2015)
Kantere, V., Dash, D., Francois, G., Kyriakopoulou, S., Ailamaki, A.: Optimal service pricing for a cloud cache. IEEE Transactions on Knowledge and Data Engineering 23(9), 1345–1358 (2011)
Article Google Scholar
Kllapi, H., Sitaridi, E., Tsangaris, M.M., Ioannidis, Y.E.: Schedule optimization for data processing ows on the cloud. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 289–300 (2011)
Google Scholar
Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., Warfield, A.: Xen and the art of virtualization. ACM SIGOPS Operating Systems Review 37(5), 164–177 (2003)
Article Google Scholar
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Murthy, R.: Hive-a petabyte scale data warehouse using hadoop. In: ICDE, pp. 996–1005 (2010)
Google Scholar
Soror, A.A., Minhas, U.F., Aboulnaga, A., Salem, K., Kokosielis, P., Kamath, S.: Automatic virtual machine configuration for database workloads. ACM Transactions on Database Systems (TODS) 35(1), 7 (2010)
Article Google Scholar
Aboulnaga, A., Amza, C., Salem, K.: Virtualization and databases: state of the art and research challenges. In: Proceedings of the 11th International Conference on Extending Database Technology: Advances in Database Technology, pp. 746–747 (2008)
Google Scholar
Dokeroglu, T., Ozal, S., Bayir, M.A., Cinar, M.S., Cosar, A.: Improving the performance of Hadoop Hive by sharing scan and computation tasks. Journal of Cloud Computing 3(1), 1–11 (2014)
Article Google Scholar
Dokeroglu, T., Sert, S.A., Cinar, M.S.: Evolutionary multiobjective query workload optimization of Cloud data warehouses. The Scientific World Journal (2014)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)
Article Google Scholar
Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: Mapreduce online. In: Proc. of the 7th USENIX Conf. on Networked Systems Design and Implementation (2010)
Google Scholar
Stonebraker, M., et al.: MapReduce and parallel DBMSs: friends or foes. Communications of the ACM 53(1), 64–71 (2010)
Article Google Scholar
Stonebraker, M., Aoki, P.M., Litwin, W., Pfeffer, A., Sah, A., Sidell, J., Sidell, J.: Mariposa: a wide-area distributed database system. The VLDB Journal 5(1), 48–63 (1996)
Article Google Scholar
Marbukh, V., Mills, K.: Demand pricing and resource allocation in market-based compute grids: a model and initial results. In: ICN 2008, pp. 752–757 (2008)
Google Scholar
Moreno, R., Alonso-Conde, A.B.: Job scheduling and resource management techniques in economic grid environments. In: Fernández Rivera, F., Bubak, M., Gómez Tato, A., Doallo, R. (eds.) Across Grids 2003. LNCS, vol. 2970, pp. 25–32. Springer, Heidelberg (2004)
Chapter Google Scholar
Berriman, G.B., Juve, G., Deelman, E., Regelson, M., Plavchan, P.: The application of cloud computing to astronomy: a study of cost and performance. In: Sixth IEEE International Conference e-Science Workshops, pp. 1–7 (2010)
Google Scholar
Tsakalozos, K., Kllapi, H., Sitaridi, E., Roussopoulos, M., Paparas, D., Delis, A.: Flexible use of cloud resources through profit maximization and price discrimination. In: 2011 IEEE 27th International Conference on Data Engineering (ICDE), pp. 75–86 (2011)
Google Scholar
Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proc. of the VLDB 2(1), 922–933 (2009)
Article Google Scholar
Weikum, G., Moenkeberg, A., Hasse, C., Zabback, P.: Self-tuning database technology and information services: from wishful thinking to viable engineering. In: Proceedings of VLDB, pp. 20–31 (2002)
Google Scholar
Agrawal, S., Chaudhuri, S., Das, A., Narasayya, V.: Automating layout of relational databases. In: ICDE, pp. 607–618 (2003)
Google Scholar
Dash, D., Kantere, V., Ailamaki, A.: An economic model for self-tuned cloud caching. In: IEEE 25th International Conference on Data Engineering, ICDE 2009, pp. 1687–1693 (2009)
Google Scholar
Deelman, E., Singh, G., Livny, M., Berriman, B., Good, J.: The cost of doing science on the cloud: the montage example. In: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, p. 50 (2008)
Google Scholar
Hadoop Hive project. http://hadoop.apache.org/hive/ (last accessed May 1, 2015)
Dai, W., Bassiouni, M.: An improved task assignment scheme for Hadoop running in the clouds. Journal of Cloud Computing: Advances, Systems and Applications 2(1), 1–16 (2013)
Google Scholar
Chatziantoniou, D., Tzortzakakis, E.: Asset queries: a declarative alternative to mapreduce. ACM SIGMOD Record 38(2), 35–41 (2009)
Article Google Scholar
Mahboubi, H., Darmont, J.: Enhancing XML data warehouse query performance by fragmentation. In: Proceedings of ACM Symposium on Applied Computing, pp. 1555–1562 (2009)
Google Scholar
Ordonez, C., Song, I.Y., Garcia-Alvarado, C.: Relational versus non-relational database systems for data warehousing. In: Proc. of the ACM 13th Int. Workshop on Data warehousing and OLAP, pp. 67–68 (2010)
Google Scholar
Armbrust, M., Fox, A., Griffith, R., Joseph, A.D., Katz, R., Konwinski, A., Zaharia, M.: A view of cloud computing. Communications of the ACM 53(4), 50–58 (2010)
Article Google Scholar
Zhou, J., Larson, P.A., Elmongui, H.G.: Lazy maintenance of materialized views. In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 231–242 (2007)
Google Scholar
Storm, A.J., Garcia-Arellano, C., Lightstone, S.S., Diao, Y., Surendra, M.: Adaptive self-tuning memory in DB2. In: Proceedings of VLDB, pp. 1081–1092 (2006)
Google Scholar
Running TPC-H queries on Hive. http://issues.apache.org/jira/browse/HIVE-600 (last accessed May 1, 2015)

Download references

Author information

Authors and Affiliations

Computer Engineering Department of Middle East Technical University, Universities Street, 6800, Cankaya, Ankara, Turkey
Tansel Dokeroglu, Muhammet Serkan Cınar, Seyyit Alper Sert, Ahmet Cosar & Adnan Yazıcı

Authors

Tansel Dokeroglu
View author publications
You can also search for this author in PubMed Google Scholar
Muhammet Serkan Cınar
View author publications
You can also search for this author in PubMed Google Scholar
Seyyit Alper Sert
View author publications
You can also search for this author in PubMed Google Scholar
Ahmet Cosar
View author publications
You can also search for this author in PubMed Google Scholar
Adnan Yazıcı
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tansel Dokeroglu .

Editor information

Editors and Affiliations

Dept. of Comm., Business & Info. Tech., Roskilde University, Roskilde, Denmark
Troels Andreasen
Dept. Computer Science, Roskilde University, Roskilde, Denmark
Henning Christiansen
Sys Res Intit of Polish Acad of Science, Intelligent Systems Laboratory, Warsaw, Poland
Janusz Kacprzyk
Department of Electronic Systems, Aalborg University, Esbjerg, Denmark
Henrik Larsen
Computer Science Department, University Milano-Biccoca, Milano, Italy
Gabriella Pasi
IRISA, University of Rennes I, Rennes, France
Olivier Pivert
Dept. of Tele. & Info. Processing, Ghent University, Gent, Belgium
Guy De Tré
Dept. of Comp. & Art. Int. Sci., University of Granada, Granada, Spain
Maria Amparo Vila
Department of Computer Engineering, Middle East Technical University, Ankara, Turkey
Adnan Yazici
Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland
Sławomir Zadrożny

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dokeroglu, T., Cınar, M.S., Sert, S.A., Cosar, A., Yazıcı, A. (2016). Improving Hadoop Hive Query Response Times Through Efficient Virtual Resource Allocation. In: Andreasen, T., et al. Flexible Query Answering Systems 2015. Advances in Intelligent Systems and Computing, vol 400. Springer, Cham. https://doi.org/10.1007/978-3-319-26154-6_17

Download citation

DOI: https://doi.org/10.1007/978-3-319-26154-6_17
Published: 21 October 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26153-9
Online ISBN: 978-3-319-26154-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics