SDWP: A New Data Placement Strategy for Distributed Big Data Warehouses in Hadoop

Ramdane, Yassine; Kabachi, Nadia; Boussaid, Omar; Bentayeb, Fadila

doi:10.1007/978-3-030-27520-4_14

Yassine Ramdane¹³,
Nadia Kabachi¹⁴,
Omar Boussaid¹³ &
…
Fadila Bentayeb¹³

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11708))

Included in the following conference series:

International Conference on Big Data Analytics and Knowledge Discovery

1534 Accesses
2 Citations

Abstract

Horizontal partitioning techniques have been used for many purposes in big data processing, such as load balancing, skipping unnecessary data loads, and guiding the physical design of a data warehouse. In big data warehouses, the most expensive operation of an OLAP query is the star join, which requires many Spark stages. In this paper, we propose a new data placement strategy in the Apache Hadoop environment called “Smart Data Warehouse Placement (SDWP)”, which allows performing star join operation in only one Spark stage. We investigate the problem of partitioning and load balancing in a cluster of homogeneous nodes. We take into account the characteristics of the cluster and the size of the data warehouse. With our approach, almost all operations of an OLAP query are executed in parallel during the first Spark stage, without a shuffle phase. Our experiments show that our proposed method enhances OLAP query performances in terms of execution time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
A straggler is a task that performs more poorly than similar ones due to insufficient assigned resources.
2.
\(MSE=\sum _{j=1}^k\sum _{X_i\in C_j} \frac{\Vert X_i-C_j\Vert ^{2}}{n}\), Where \(X_i\) denotes the data point locations, i.e., tuples or vectors of the matrix MV, \(C_j\) denotes the centroid locations, and \(n=|MV|\).
3.
Available from the site https://github.com/databricks/spark-sql-perf.

References

Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proc. VLDB Endow. 2(1), 922–933 (2009)
Article Google Scholar
Afrati, F.N., Ullman, J.D.: Optimizing multiway joins in a map-reduce environment. IEEE Trans. Knowl. Data Eng. 23(9), 1282–1298 (2011)
Article Google Scholar
Arres, B., Kabachi, N. and Boussaid, O.: Optimizing OLAP cubes construction by improving data placement on multi-nodes clusters. In: 23rd Euromicro International Conference on Parallel, Distributed and Network-Based Processing, pp. 520–524. IEEE (2015)
Google Scholar
Azez, H.S.A., Khafagy, M.H., Omara, F.A.: JOUM: an indexing methodology for improving join in HIVE star schema. Int. J. Sci. Eng. Res. 6, 111–119 (2015)
Google Scholar
Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J. Tian, Y.: A comparison of join algorithms for log processing in MapReduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 975–986. ACM (2010)
Google Scholar
Brito, J.J., Mosqueiro, T., Ciferri, R.R., de Aguiar Ciferri, C.D.: Faster cloud Star Joins with reduced disk spill and network communication. Procedia Comput. Sci. 80, 74–85 (2016)
Article Google Scholar
Chen, K., Zhou, Y., Cao, Y.: Online data partitioning in distributed database systems. In: EDBT, pp. 1–12 (2015)
Google Scholar
Dittrich, J., Quiané-Ruiz, J.A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). Proc. VLDB Endow. 3(1–2), 515–529 (2010)
Article Google Scholar
Eltabakh, M.Y., Tian, Y., Özcan, F., Gemulla, R., Krettek, A., McPherson, J.: CoHadoop: flexible data placement and its exploitation in Hadoop. Proc. VLDB Endow. 4(9), 575–585 (2011)
Article Google Scholar
Golfarelli, M., Baldacci, L.: A cost model for SPARK SQL. IEEE Trans. Knowl. Data Eng. 31, 819–832 (2018)
Google Scholar
Kalinsky, O., Etsion, Y., Kimelfeld, B.: Flexible caching in trie joins. arXiv preprint arXiv:1602.08721 (2016)
Lu, Y., Shanbhag, A., Jindal, A., Madden, S.: AdaptDB: adaptive partitioning for distributed joins. Proc. VLDB Endow. 10(5), 589–600 (2017)
Article Google Scholar
Malinen, M.I., Fränti, P.: Balanced K-means for clustering. In: Fränti, P., Brown, G., Loog, M., Escolano, F., Pelillo, M. (eds.) S+SSPR 2014. LNCS, vol. 8621, pp. 32–41. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-662-44415-3_4
Chapter Google Scholar
Purdilă, V., Pentiuc, Ş.G.: Single-scan: a fast star-join query processing algorithm. Softw. Pract. Exp. 46(3), 319–339 (2016)
Article Google Scholar
Petridis, P., Gounaris, A., Torres, J.: Spark parameter tuning via trial-and-error. In: Angelov, P., Manolopoulos, Y., Iliadis, L., Roy, A., Vellasco, M. (eds.) INNS 2016. AISC, vol. 529, pp. 226–237. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-47898-2_24
Chapter Google Scholar
Zamanian, E., Binnig, C., Salama, A.: Locality-aware partitioning in parallel database systems. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 17–30. ACM (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Lyon, Lyon 2, ERIC EA 3083, 5, avenue Pierre Mendes, 69676, Bron-CEDEX, France
Yassine Ramdane, Omar Boussaid & Fadila Bentayeb
University of Lyon, University Claude Bernard Lyon 1, ERIC EA 3083, 43, boulevard du 11 novembre 1918, 69100, Villeurbanne, France
Nadia Kabachi

Authors

Yassine Ramdane
View author publications
You can also search for this author in PubMed Google Scholar
Nadia Kabachi
View author publications
You can also search for this author in PubMed Google Scholar
Omar Boussaid
View author publications
You can also search for this author in PubMed Google Scholar
Fadila Bentayeb
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Yassine Ramdane , Nadia Kabachi , Omar Boussaid or Fadila Bentayeb .

Editor information

Editors and Affiliations

University of Houston, Houston, TX, USA
Carlos Ordonez
Drexel University, Philadelphia, PA, USA
Il-Yeol Song
Johannes Kepler University of Linz, Linz, Austria
Gabriele Anderst-Kotsis
Software Competence Center Hagenberg, Hagenberg im Mühlkreis, Austria
A Min Tjoa
Johannes Kepler University of Linz, Linz, Austria
Ismail Khalil

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ramdane, Y., Kabachi, N., Boussaid, O., Bentayeb, F. (2019). SDWP: A New Data Placement Strategy for Distributed Big Data Warehouses in Hadoop. In: Ordonez, C., Song, IY., Anderst-Kotsis, G., Tjoa, A., Khalil, I. (eds) Big Data Analytics and Knowledge Discovery. DaWaK 2019. Lecture Notes in Computer Science(), vol 11708. Springer, Cham. https://doi.org/10.1007/978-3-030-27520-4_14

Download citation

DOI: https://doi.org/10.1007/978-3-030-27520-4_14
Published: 03 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-27519-8
Online ISBN: 978-3-030-27520-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics