skip to main content
10.1145/3274005.3274021acmotherconferencesArticle/Chapter ViewAbstractPublication PagescompsystechConference Proceedingsconference-collections
research-article

Big Data Technologies on Commodity Workstations: A Basic Setup for Apache Impala

Published: 13 September 2018 Publication History

Abstract

Big Data technologies brought the idea of parallel processing on cheaper commodity servers. When dealing with huge amount of data, instead of migrating to more performant and costly hardware platforms, or buying resources in cloud, it is more affordable to add a number of cheaper servers as nodes for data processing and/or storage. NoSQL data stores, Hadoop ecosystems, NewSQL platforms have proved viable for Big Data storage and processing. In this paper we were concerned with setting up a platform for big data processing using commodity workstations. Many small and medium sized companies have limited resources and their workstations remain unused for more than 12 hours a day. Here Beowulf Cluster Computing could prove useful. Apache Impala was installed as part of a Hadoop distribution on a 9-node cluster. Three TPC-H database schema were loaded for the scale factors of 1, 2 and 10GB. A series of 100 SQL queries were randomly generated and executed for each scale factor. Results were collected and analyzed for determining if the cluster can provide a decent level of data processing performance.

References

[1]
Michael Stonebraker, M. 2012. What Does 'Big Data' Mean?, Commun. ACM (BLOG@CACM), September 21, 2012, http://cacm.acm.org/blogs/blog-cacm/155468-what-does-big-data-mean/fulltext
[2]
H.U. Buhl, M. Röglinger, and F. Moser. 2013. Big Data: A Fashionable Topic with(out) Sustainable Relevance for Research and Practice?, Business & Information Systems Engineering, 2 (2013), 65--69.
[3]
R. Cattell. 2010. Scalable SQL and NoSQL Data Stores, ACM SIGMOD Record, 39, 4 (2010), 12--27
[4]
E.G. Caldarola and A. Maria Rinaldi. 2015. Big Data: A Survey. The New Paradigms, Methodologies and Tools. In Proc. of DATA 2015 4th International Conference on Data Management Technologies and Applications, Colmar, France, 362--370.
[5]
F. Li, B.C. Ooi, M. T. Özsu, and S. Wu. 2014. Distributed data management using MapReduce, ACM Computing Surveys, 46(3), Article 31
[6]
Michael Stonebraker. 2015. Hadoop at a Crossroads, Commun. ACM, 58, 1 (2015), 18--19.
[7]
A. Thusoo, J.S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff and R. Murthy. 2009. Hive: a warehousing solution over a map-reduce framework, Proceedings of the VLDB Endowment, 2, 2 (2009), 1626--1629.
[8]
Michales Stonebraker. 2012. New opportunities for New SQL, Communic. ACM, 55, 11 (2012), 10--11
[9]
P. Trancoso. 2015. Moving to memoryland: in-memory computation for existing applications, In Proceedings of the 12th ACM International Conference on Computing Frontiers (CF '15), Ischia:Italy, ACM, New York, NY, USA, Article 32, 6 pages.
[10]
A. Pavlo and M. Aslett. 2016. What's Really New with NewSQL?, ACM SIGMOD Record, 45, 2 (2016), 45--55.
[11]
Jenett Tillotson. 2017. A Summary of the Workshop on Training and Sustaining Research Computing Systems Professionals: A Path to a Profession. In Proceedings of the HPC Systems Professionals Workshop (HPCSYSPROS'17). ACM, New York, NY, USA, Article 3, 6 pages.
[12]
T.L. Sterling. 2002. Beowulf Cluster Computing with Linux, MIT Press, Cambridge, Massachusetts
[13]
M. Kornacker et al. 2015. Impala: A modern, open-source SQL engine for Hadoop. In Proceedings of the Seventh Biennial CIDR Conference on Innovative Data Systems Research, Asilomar, CA, Jan. 4--7, 2015
[14]
TPC. 2014. TPC Benchmark H (Decision Support) Standard Specification Revision 2.17.1, 2014, http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf
[15]
T. Kejser. 2014. TPC-H: Data And Query Generation, http://kejser.org/tpc-h-data-and-query-generation/
[16]
M. Fotache and I. Hrubaru. 2016. Performance Analysis of Two Big Data Technologies on a Cloud Distributed Architecture. Results for Non-Aggregate Queries on Medium-Sized Data. Scientific Annals of Economics and Business, 63(SI), 2016, 21--50.
[17]
R Core Team. 2017. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
[18]
H. Wickham. 2017. tidyverse: Easily Install and Load 'Tidyverse' Packages. R package version 1.1.1, https://CRAN.R-project.org/package=tidyverse
[19]
H. Wickham. 2016. ggplot2: Elegant Graphics for Data Analysis (2nd ed.), Springer, New York

Index Terms

  1. Big Data Technologies on Commodity Workstations: A Basic Setup for Apache Impala

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    CompSysTech '18: Proceedings of the 19th International Conference on Computer Systems and Technologies
    September 2018
    206 pages
    ISBN:9781450364256
    DOI:10.1145/3274005
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    In-Cooperation

    • ERSVB: EURORISC SYSTEMS - Varna, Bulgaria
    • FOSEUB: FEDERATION OF THE SCIENTIFIC ENGINEERING UNIONS - Bulgaria
    • UORB: University of Ruse, Bulgaria
    • TECHUVB: Technical University of Varna, Bulgaria

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 September 2018

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Beowulf clustering
    2. Distributed computing
    3. Hadoop
    4. Impala
    5. Query performance

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    CompSysTech'18

    Acceptance Rates

    Overall Acceptance Rate 241 of 492 submissions, 49%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 64
      Total Downloads
    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 07 Mar 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media