Skip to main content

SDWP: A New Data Placement Strategy for Distributed Big Data Warehouses in Hadoop

  • Conference paper
  • First Online:
Big Data Analytics and Knowledge Discovery (DaWaK 2019)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11708))

Included in the following conference series:

Abstract

Horizontal partitioning techniques have been used for many purposes in big data processing, such as load balancing, skipping unnecessary data loads, and guiding the physical design of a data warehouse. In big data warehouses, the most expensive operation of an OLAP query is the star join, which requires many Spark stages. In this paper, we propose a new data placement strategy in the Apache Hadoop environment called “Smart Data Warehouse Placement (SDWP)”, which allows performing star join operation in only one Spark stage. We investigate the problem of partitioning and load balancing in a cluster of homogeneous nodes. We take into account the characteristics of the cluster and the size of the data warehouse. With our approach, almost all operations of an OLAP query are executed in parallel during the first Spark stage, without a shuffle phase. Our experiments show that our proposed method enhances OLAP query performances in terms of execution time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    A straggler is a task that performs more poorly than similar ones due to insufficient assigned resources.

  2. 2.

    \(MSE=\sum _{j=1}^k\sum _{X_i\in C_j} \frac{\Vert X_i-C_j\Vert ^{2}}{n}\), Where \(X_i\) denotes the data point locations, i.e., tuples or vectors of the matrix MV, \(C_j\) denotes the centroid locations, and \(n=|MV|\).

  3. 3.

    Available from the site https://github.com/databricks/spark-sql-perf.

References

  1. Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proc. VLDB Endow. 2(1), 922–933 (2009)

    Article  Google Scholar 

  2. Afrati, F.N., Ullman, J.D.: Optimizing multiway joins in a map-reduce environment. IEEE Trans. Knowl. Data Eng. 23(9), 1282–1298 (2011)

    Article  Google Scholar 

  3. Arres, B., Kabachi, N. and Boussaid, O.: Optimizing OLAP cubes construction by improving data placement on multi-nodes clusters. In: 23rd Euromicro International Conference on Parallel, Distributed and Network-Based Processing, pp. 520–524. IEEE (2015)

    Google Scholar 

  4. Azez, H.S.A., Khafagy, M.H., Omara, F.A.: JOUM: an indexing methodology for improving join in HIVE star schema. Int. J. Sci. Eng. Res. 6, 111–119 (2015)

    Google Scholar 

  5. Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J. Tian, Y.: A comparison of join algorithms for log processing in MapReduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 975–986. ACM (2010)

    Google Scholar 

  6. Brito, J.J., Mosqueiro, T., Ciferri, R.R., de Aguiar Ciferri, C.D.: Faster cloud Star Joins with reduced disk spill and network communication. Procedia Comput. Sci. 80, 74–85 (2016)

    Article  Google Scholar 

  7. Chen, K., Zhou, Y., Cao, Y.: Online data partitioning in distributed database systems. In: EDBT, pp. 1–12 (2015)

    Google Scholar 

  8. Dittrich, J., Quiané-Ruiz, J.A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). Proc. VLDB Endow. 3(1–2), 515–529 (2010)

    Article  Google Scholar 

  9. Eltabakh, M.Y., Tian, Y., Özcan, F., Gemulla, R., Krettek, A., McPherson, J.: CoHadoop: flexible data placement and its exploitation in Hadoop. Proc. VLDB Endow. 4(9), 575–585 (2011)

    Article  Google Scholar 

  10. Golfarelli, M., Baldacci, L.: A cost model for SPARK SQL. IEEE Trans. Knowl. Data Eng. 31, 819–832 (2018)

    Google Scholar 

  11. Kalinsky, O., Etsion, Y., Kimelfeld, B.: Flexible caching in trie joins. arXiv preprint arXiv:1602.08721 (2016)

  12. Lu, Y., Shanbhag, A., Jindal, A., Madden, S.: AdaptDB: adaptive partitioning for distributed joins. Proc. VLDB Endow. 10(5), 589–600 (2017)

    Article  Google Scholar 

  13. Malinen, M.I., Fränti, P.: Balanced K-means for clustering. In: Fränti, P., Brown, G., Loog, M., Escolano, F., Pelillo, M. (eds.) S+SSPR 2014. LNCS, vol. 8621, pp. 32–41. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-662-44415-3_4

    Chapter  Google Scholar 

  14. Purdilă, V., Pentiuc, Ş.G.: Single-scan: a fast star-join query processing algorithm. Softw. Pract. Exp. 46(3), 319–339 (2016)

    Article  Google Scholar 

  15. Petridis, P., Gounaris, A., Torres, J.: Spark parameter tuning via trial-and-error. In: Angelov, P., Manolopoulos, Y., Iliadis, L., Roy, A., Vellasco, M. (eds.) INNS 2016. AISC, vol. 529, pp. 226–237. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-47898-2_24

    Chapter  Google Scholar 

  16. Zamanian, E., Binnig, C., Salama, A.: Locality-aware partitioning in parallel database systems. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 17–30. ACM (2015)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Yassine Ramdane , Nadia Kabachi , Omar Boussaid or Fadila Bentayeb .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ramdane, Y., Kabachi, N., Boussaid, O., Bentayeb, F. (2019). SDWP: A New Data Placement Strategy for Distributed Big Data Warehouses in Hadoop. In: Ordonez, C., Song, IY., Anderst-Kotsis, G., Tjoa, A., Khalil, I. (eds) Big Data Analytics and Knowledge Discovery. DaWaK 2019. Lecture Notes in Computer Science(), vol 11708. Springer, Cham. https://doi.org/10.1007/978-3-030-27520-4_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-27520-4_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-27519-8

  • Online ISBN: 978-3-030-27520-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics