ResilientStore: A Heuristic-Based Data Format Selector for Intermediate Results

Munir, Rana Faisal; Romero, Oscar; Abelló, Alberto; Bilalli, Besim; Thiele, Maik; Lehner, Wolfgang

doi:10.1007/978-3-319-45547-1_4

Rana Faisal Munir¹⁷,
Oscar Romero¹⁷,
Alberto Abelló¹⁷,
Besim Bilalli¹⁷,
Maik Thiele¹⁸ &
…
Wolfgang Lehner¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 9893))

Included in the following conference series:

International Conference on Model and Data Engineering

754 Accesses
1 Citations
1 Altmetric

Abstract

Large-scale data analysis is an important activity in many organizations that typically requires the deployment of data-intensive workflows. As data is processed these workflows generate large intermediate results, which are typically pipelined from one operator to the following. However, if materialized, these results become reusable, hence, subsequent workflows need not recompute them. There are already many solutions that materialize intermediate results but all of them assume a fixed data format. A fixed format, however, may not be the optimal one for every situation. For example, it is well-known that different data fragmentation strategies (e.g., horizontal and vertical) behave better or worse according to the access patterns of the subsequent operations. In this paper, we present ResilientStore, which assists on selecting the most appropriate data format for materializing intermediate results. Given a workflow and a set of materialization points, it uses rule-based heuristics to choose the best storage data format based on subsequent access patterns. We have implemented ResilientStore for HDFS and three different data formats: SequenceFile, Parquet and Avro. Experimental results show that our solution gives 18 % better performance than any solution based on a single fixed format.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

ATUN-HL: Auto Tuning of Hybrid Layouts Using Workload and Data Characteristics

A cost-based storage format selector for materialized results in big data frameworks

Article 08 May 2019

Performance Assessment of the Map Reduce Framework with HDFS for High Availability and Fault Tolerance

Notes

1.
https://hadoop.apache.org.
2.
http://hive.apache.org.
3.
http://pig.apache.org.
4.
https://orc.apache.org.
5.
http://avro.apache.org.
6.
http://parquet.apache.org.
7.
http://wiki.apache.org/hadoop/SequenceFile.
8.
http://www.svds.com/how-to-choose-a-data-format.
9.
http://pig.apache.org/docs/r0.9.1/zebra_pig.html.
10.
http://www.tpc.org/tpch.
11.
A Pig operation combining GROUP BY and JOIN.
12.
http://www.ac.upc.edu/serveis-tic/altas-prestaciones.
13.
http://www.tpc.org/tpch.
14.
http://ranafaisal.info/?attachment_id=153.

References

Abelló, A., Ferrarons, J., Romero, O.: Building cubes with MapReduce. In: Proceedings of the DOLAP (2011)
Google Scholar
Alagiannis, I., Idreos, S., Ailamaki, A.: H2O: a hands-free adaptive store. In: Proceedings of the SIGMOD (2014)
Google Scholar
Chen, Y., Alspaugh, S., Katz, R.: Interactive analytical processing in big data systems: a cross-industry study of MapReduce workloads. In: Proceedings of the VLDB (2012)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: Proceedings of the OSDI (2004)
Google Scholar
DeWitt, D.J., Halverson, A., Nehme, R., Shankar, S., Aguilar-Saborit, J., Avanes, A., Flasza, M., Gramling, J.: Split query processing in polybase. In: Proceedings of the SIGMOD (2013)
Google Scholar
Elghandour, I., Aboulnaga, A.: ReStore: reusing results of MapReduce jobs. In: Proceedings of the VLDB (2012)
Google Scholar
Elmore, A., Duggan, J., Stonebraker, M., Balazinska, M., Gadepally, V., Heer, J., Howe, B., Kepner, J., Kraska, T., Madden, S., Maier, D., Mattson, T., Papadopoulos, S., Parkhurst, J., Tatbul, N., Vartak, M., Zdonik, S.: A demonstration of the BigDAWG polystore system. In: Proceedings of the VLDB (2015)
Google Scholar
Färber, F., Cha, S.K., Primsch, J., Bornhovd, C., Sigg, S., Lehner, W.: SAP HANA database - data management for modern business applications. In: Proceedings of the SIGMOD Record (2011)
Google Scholar
Floratou, A., Patel, J.M., Shekita, E.J., Tata, S.: Column-oriented storage techniques for MapReduce. In: Proceedings of the VLDB (2011)
Google Scholar
Ghemawat, S., Gobioff, H., Leung, S.-T.: The Google file system. In: Proceedings of the SOSP (2003)
Google Scholar
He, Y., Lee, R., Huai, Y., Shao, Z., Jain, N., Zhang, X., Xu, Z.: RCFile: a fast and space-efficient data placement structure in MapReduce-based warehouse systems. In: Proceedings of the ICDE (2011)
Google Scholar
Idreos, S., Alagiannis, I., Johnson, R., Ailamaki, A.: Here are my Data Files. Here are my Queries. Where are my Results? In: Proceedings of the CIDR (2011)
Google Scholar
Jindal, A., Quian-Ruiz, J.-A., Dittrich, J.: Trojan data layouts: right shoes for a running elephant. In: Proceedings of the SOCC (2011)
Google Scholar
Jindal, A., Quian-Ruiz, J.-A., Dittrich, J.: WWHow! freeing data storage from cages. In: Proceedings of the CIDR (2013)
Google Scholar
Jovanovic, P., Romero, O., Simitsis, A., Abelló, A.: Incremental consolidation of data-intensive multi-flows. In: Proceedings of the TKDE (2016)
Google Scholar
Kalavri, V., Shang, H., Vlassov, V.: m2r2: a framework for results materialization and reuse. In: Proceedings of the BDSE (2013)
Google Scholar
Raman, V., Attaluri, G., Barber, R., Chainani, N., Kalmuk, D., KulandaiSamy, V., Leenstra, J., Lightstone, S., Liu, S., Lohman, G.M., Malkemus, T., Mueller, R., Pandis, I., Schiefer, B., Sharpe, D., Sidle, R., Storm, A., Zhang, L.: DB2 with BLU acceleration: so much more than just a column store. In: Proceedings of the VLDB (2013)
Google Scholar
Schaarschmidt, M., Gessert, F., Ritter, N.: Towards automated polyglot persistence. In: Proceedings of the BTW (2015)
Google Scholar

Download references

Acknowledgments

This research has been funded by the European Commission through the Erasmus Mundus Joint Doctorate “Information Technologies for Business Intelligence - Doctoral College” (IT4BI-DC).

Author information

Authors and Affiliations

Universitat Politécnica de Catalunya (UPC), Barcelona, Spain
Rana Faisal Munir, Oscar Romero, Alberto Abelló & Besim Bilalli
Technische Universität Dresden (TUD), Dresden, Germany
Maik Thiele & Wolfgang Lehner

Authors

Rana Faisal Munir
View author publications
You can also search for this author in PubMed Google Scholar
Oscar Romero
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Abelló
View author publications
You can also search for this author in PubMed Google Scholar
Besim Bilalli
View author publications
You can also search for this author in PubMed Google Scholar
Maik Thiele
View author publications
You can also search for this author in PubMed Google Scholar
Wolfgang Lehner
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rana Faisal Munir .

Editor information

Editors and Affiliations

LIAS/ISAE-ENSMA , Futuroscope Chasseneuil, France
Ladjel Bellatreche
Department of Information Systems and Computation, Universitat Politècnica de València, Valencia, Spain
Óscar Pastor
University of Almería , Almería, Spain
Jesús M. Almendros Jiménez
IRIT / ENSEIHT , Toulouse, France
Yamine Aït-Ameur

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Munir, R.F., Romero, O., Abelló, A., Bilalli, B., Thiele, M., Lehner, W. (2016). ResilientStore: A Heuristic-Based Data Format Selector for Intermediate Results. In: Bellatreche, L., Pastor, Ó., Almendros Jiménez, J., Aït-Ameur, Y. (eds) Model and Data Engineering. MEDI 2016. Lecture Notes in Computer Science(), vol 9893. Springer, Cham. https://doi.org/10.1007/978-3-319-45547-1_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-45547-1_4
Published: 07 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45546-4
Online ISBN: 978-3-319-45547-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics