research-article

Big Data Technologies on Commodity Workstations: A Basic Setup for Apache Impala

Authors:

Valerică Greavu-Şerban,

Ionuţ Hrubaru,

Alexandru TicăAuthors Info & Claims

CompSysTech '18: Proceedings of the 19th International Conference on Computer Systems and Technologies

Pages 110 - 115

https://doi.org/10.1145/3274005.3274021

Published: 13 September 2018 Publication History

Abstract

Big Data technologies brought the idea of parallel processing on cheaper commodity servers. When dealing with huge amount of data, instead of migrating to more performant and costly hardware platforms, or buying resources in cloud, it is more affordable to add a number of cheaper servers as nodes for data processing and/or storage. NoSQL data stores, Hadoop ecosystems, NewSQL platforms have proved viable for Big Data storage and processing. In this paper we were concerned with setting up a platform for big data processing using commodity workstations. Many small and medium sized companies have limited resources and their workstations remain unused for more than 12 hours a day. Here Beowulf Cluster Computing could prove useful. Apache Impala was installed as part of a Hadoop distribution on a 9-node cluster. Three TPC-H database schema were loaded for the scale factors of 1, 2 and 10GB. A series of 100 SQL queries were randomly generated and executed for each scale factor. Results were collected and analyzed for determining if the cluster can provide a decent level of data processing performance.

References

[1]

Michael Stonebraker, M. 2012. What Does 'Big Data' Mean?, Commun. ACM (BLOG@CACM), September 21, 2012, http://cacm.acm.org/blogs/blog-cacm/155468-what-does-big-data-mean/fulltext

[2]

H.U. Buhl, M. Röglinger, and F. Moser. 2013. Big Data: A Fashionable Topic with(out) Sustainable Relevance for Research and Practice?, Business & Information Systems Engineering, 2 (2013), 65--69.

[3]

R. Cattell. 2010. Scalable SQL and NoSQL Data Stores, ACM SIGMOD Record, 39, 4 (2010), 12--27

Digital Library

[4]

E.G. Caldarola and A. Maria Rinaldi. 2015. Big Data: A Survey. The New Paradigms, Methodologies and Tools. In Proc. of DATA 2015 4th International Conference on Data Management Technologies and Applications, Colmar, France, 362--370.

Digital Library

[5]

F. Li, B.C. Ooi, M. T. Özsu, and S. Wu. 2014. Distributed data management using MapReduce, ACM Computing Surveys, 46(3), Article 31

Digital Library

[6]

Michael Stonebraker. 2015. Hadoop at a Crossroads, Commun. ACM, 58, 1 (2015), 18--19.

[7]

A. Thusoo, J.S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff and R. Murthy. 2009. Hive: a warehousing solution over a map-reduce framework, Proceedings of the VLDB Endowment, 2, 2 (2009), 1626--1629.

Digital Library

[8]

Michales Stonebraker. 2012. New opportunities for New SQL, Communic. ACM, 55, 11 (2012), 10--11

Digital Library

[9]

P. Trancoso. 2015. Moving to memoryland: in-memory computation for existing applications, In Proceedings of the 12th ACM International Conference on Computing Frontiers (CF '15), Ischia:Italy, ACM, New York, NY, USA, Article 32, 6 pages.

Digital Library

[10]

A. Pavlo and M. Aslett. 2016. What's Really New with NewSQL?, ACM SIGMOD Record, 45, 2 (2016), 45--55.

Digital Library

[11]

Jenett Tillotson. 2017. A Summary of the Workshop on Training and Sustaining Research Computing Systems Professionals: A Path to a Profession. In Proceedings of the HPC Systems Professionals Workshop (HPCSYSPROS'17). ACM, New York, NY, USA, Article 3, 6 pages.

Digital Library

[12]

T.L. Sterling. 2002. Beowulf Cluster Computing with Linux, MIT Press, Cambridge, Massachusetts

Digital Library

[13]

M. Kornacker et al. 2015. Impala: A modern, open-source SQL engine for Hadoop. In Proceedings of the Seventh Biennial CIDR Conference on Innovative Data Systems Research, Asilomar, CA, Jan. 4--7, 2015

[14]

TPC. 2014. TPC Benchmark H (Decision Support) Standard Specification Revision 2.17.1, 2014, http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf

[15]

T. Kejser. 2014. TPC-H: Data And Query Generation, http://kejser.org/tpc-h-data-and-query-generation/

[16]

M. Fotache and I. Hrubaru. 2016. Performance Analysis of Two Big Data Technologies on a Cloud Distributed Architecture. Results for Non-Aggregate Queries on Medium-Sized Data. Scientific Annals of Economics and Business, 63(SI), 2016, 21--50.

[17]

R Core Team. 2017. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.

[18]

H. Wickham. 2017. tidyverse: Easily Install and Load 'Tidyverse' Packages. R package version 1.1.1, https://CRAN.R-project.org/package=tidyverse

[19]

H. Wickham. 2016. ggplot2: Elegant Graphics for Data Analysis (2nd ed.), Springer, New York

Digital Library

Index Terms

Big Data Technologies on Commodity Workstations: A Basic Setup for Apache Impala
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
      1. Retrieval efficiency

Recommendations

The Era of Big Spatial Data: Challenges and Opportunities
MDM '15: Proceedings of the 2015 16th IEEE International Conference on Mobile Data Management - Volume 02

This seminar describes the state-of-the-art research in the area of big spatial data and it consists of four parts. Part I gives a background about big spatial data and the limitations of traditional systems in handling such data. Part II gives an ...
A Spark-Based Big Data Platform for Massive Remote Sensing Data Processing
ICDS 2015: Proceedings of the Second International Conference on Data Science - Volume 9208

With the fast development of remote sensing techniques, the volume of acquired data grows exponentially. This brings a big challenge to process massive remote sensing data. In the paper, an in-memory computing framework is proposed to address this ...
Main-memory requirements of big data applications on commodity server platform
CCGrid '18: Proceedings of the 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing

The emergence of big data frameworks requires computational and memory resources that can naturally scale to manage massive amounts of diverse data. It is currently unclear whether big data frameworks such as Hadoop, Spark, and MPI will require high ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

CompSysTech '18: Proceedings of the 19th International Conference on Computer Systems and Technologies

September 2018

206 pages

ISBN:9781450364256

DOI:10.1145/3274005

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

ERSVB: EURORISC SYSTEMS - Varna, Bulgaria
FOSEUB: FEDERATION OF THE SCIENTIFIC ENGINEERING UNIONS - Bulgaria
UORB: University of Ruse, Bulgaria
TECHUVB: Technical University of Varna, Bulgaria

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 September 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

CompSysTech'18

CompSysTech'18: 19th International Conference on Computer Systems and Technologies

September 13 - 14, 2018

Ruse, Bulgaria

Acceptance Rates

Overall Acceptance Rate 241 of 492 submissions, 49%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
64
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten