Skip to main content

Cost-Based Sharing and Recycling of (Intermediate) Results in Dataflow Programs

  • Conference paper
  • First Online:
Advances in Databases and Information Systems (ADBIS 2018)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11019))

Included in the following conference series:

Abstract

In data analytics, researchers often work on the same data-sets investigating different aspects and moreover develop their programs in an incremental manner. This opens opportunities to share and recycle results from previously executed jobs if they contain identical operations, e.g., restructuring, filtering and other kinds of data preparation.

In this paper, we present an approach to accelerate processing of such dataflow programs by materializing and recycling (intermediate) results in Apache Spark. We have implemented this idea in our Pig Latin compiler for Spark called Piglet which transparently supports both, merging of multiple jobs as well as rewriting jobs to reuse intermediate results. We discuss the opportunities for recycling, present a profiling-based cost model as well as a decision model to identify potentially beneficial materialization points. Finally, we report results of our experimental evaluation showing the validity of the cost model and the benefit of recycling.

S. Hagedorn—This work was partially funded by the German Research Foundation (DFG) under grant no. SA782/22.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://spark.apache.org/.

  2. 2.

    https://github.com/dbis-ilm/piglet.

  3. 3.

    In reality, in the model there is only one edge between the nodes. The multiple edges are just for illustrating the different jobs.

  4. 4.

    This requires that the clocks on all nodes are synchronized, of course. For example via NTP.

  5. 5.

    http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml.

  6. 6.

    http://www1.nyc.gov/site/planning/data-maps/open-data.page.

  7. 7.

    https://www.gdeltproject.org/data.html.

References

  1. Abiteboul, S., Duschka, O.M.: Complexity of answering queries using materialized views. In: PODS, pp. 254–263 (1998)

    Google Scholar 

  2. Camacho-Rodrguez, et al.: PigReuse: A Reuse-based Optimizer for Pig Latin. Technical report, Inria Saclay (2016)

    Google Scholar 

  3. Chao-Qiang, H., et al.: RDDShare: reusing results of spark RDD. In: DSC, pp. 370–375 (2016)

    Google Scholar 

  4. Chirkova, R., Halevy, A.Y., Suciu, D.: A formal perspective on the view selection problem. In: VLDB, pp. 59–68 (2001)

    Google Scholar 

  5. Elghandour, I., Aboulnaga, A.: Restore: reusing results of mapreduce jobs. In: VLDB, vol. 5, pp. 586–597 (2012)

    Google Scholar 

  6. Halevy, A.Y.: Answering queries using views: a survey. VLDB J. 10(4), 270–294 (2001)

    Article  Google Scholar 

  7. Harinarayan, V., Rajaraman, A., Ullman, J.D.: Implementing data cubes efficiently. SIGMOD Rec. 25(2), 205–216 (1996)

    Article  Google Scholar 

  8. Idreos, S., et al.: Merging what’s cracked, cracking what’s merged: adaptive indexing in main-memory column-stores. PVLDB 4(9), 585–597 (2011)

    Google Scholar 

  9. Larson, P.Å., Yang, H.Z.: Computing queries from derived relations: theoretical foundation. University of Waterloo, Department of Computer Science (1987)

    Google Scholar 

  10. Nykiel, T., et al.: MRShare: sharing across multiple queries in MapReduce. PVLDB 3(1–2), 494–505 (2010)

    Google Scholar 

  11. Perez, L.L., Jermaine, C.M.: History-aware query optimization with materialized intermediate views. In: ICDE, pp. 520–531. IEEE, March 2014

    Google Scholar 

  12. Sattler, K., Geist, I., Schallehn, E.: QUIET: continuous query-driven index tuning. In: VLDB, pp. 1129–1132 (2003)

    Google Scholar 

  13. Schnaitter, K., Abiteboul, S., Milo, T., Polyzotis, N.: COLT: continuous on-line tuning. In: SIGMOD, pp. 793–795 (2006)

    Google Scholar 

  14. Sparks, E.R., et al.: KeystoneML: optimizing pipelines for large-scale advanced analytics. In: ICDE, pp. 535–546 (2017)

    Google Scholar 

  15. Srivastava, D., Dar, S., Jagadish, H.V., Levy, A.Y.: Answering queries with aggregation using views. In: VLDB, vol. 96, pp. 318–329 (1996)

    Google Scholar 

  16. Valentin, G., et al.: DB2 advisor: an optimizer smart enough to recommend its own indexes. In: ICDE, pp. 101–110 (2000)

    Google Scholar 

  17. Wang, G., Chan, C.-Y.: Multi-query optimization in MapReduce framework. In: PVLDB, pp. 145–156 (2013)

    Google Scholar 

  18. Yang, H.Z., Larson, P.Å.: Query transformation for PSJ-queries. In: PVLDB, vol. 87, pp. 245–254 (1987)

    Google Scholar 

  19. Zhang, Y., Duc, P.M., Corcho, O., Calbimonte, J.-P.: SRBench: a streaming RDF/SPARQL benchmark. In: Cudré-Mauroux, P., et al. (eds.) ISWC 2012. LNCS, vol. 7649, pp. 641–657. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35176-1_40

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stefan Hagedorn .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hagedorn, S., Sattler, KU. (2018). Cost-Based Sharing and Recycling of (Intermediate) Results in Dataflow Programs. In: Benczúr, A., Thalheim, B., Horváth, T. (eds) Advances in Databases and Information Systems. ADBIS 2018. Lecture Notes in Computer Science(), vol 11019. Springer, Cham. https://doi.org/10.1007/978-3-319-98398-1_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-98398-1_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-98397-4

  • Online ISBN: 978-3-319-98398-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics