Abstract
Running a dataflow in a cloud environment usually generates many useful intermediate datasets. A strategy for running a dataflow is to decide which datasets should be stored, while the rest of them are regenerated. The intermediate dataset storage (IDS) problem asks to find a strategy for running a dataflow, such that the total cost is minimized. The current best algorithm for linear-structure IDS takes \(O(n^4)\) time, where “linear-structure” means that the structure of the datasets in the dataflow is a pipeline. In this paper, we present a new algorithm for this problem, and improve the time complexity to \(O(n^3)\), where \(n\) is the number of datasets in the pipeline.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Deelman, E., Chervenak, A.: Data management challenges of data-intensive scientific workflows. In: IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2008), pp. 687–692, Lyon, France (2008)
Yuan, D., Yang, Y., Liu, X., Zhang, G., Chen, J.: On-demand minimum cost benchmarking for intermediate data storage in scientific cloud workflow systems. J. Parallel Distrib. Comput. 71(2), 316–332 (2011)
Adams, I., Long, D.D.E., Miller, E.L., Pasupathy, S., Storer, M.W.: Maximizing efficiency by trading storage for computation. In: Workshop on Hot Topics in Cloud Computing (HotCloud 2009), pp. 1–5, San Diego, CA (2009)
Yuan, D., Yang, Y., Liu, X., Zhang, G., Chen, J.: A data dependency based strategy for intermediate data storage in scientific cloud workflow systems. Concurr. Comput. Pract. Exp. 24(9), 956–976 (2010)
Zohrevandi, M., Bazzi, R.A.: The bounded data reuse problem in scientific workflows. In: 2013 IEEE 27th International Symposium on Parallel & Distributed Processing, pp. 1051–1062 (2013)
Han, L.X., Xie, Z., Baldock, R.: Automatic data reuse for accelerating data intensive applications in the Cloud. In: The 8th International Conference for Internet Technology and Secured Transactions (ICITST-2013), pp. 596–600 (2013)
Acknowledgements
This paper is supported by national natural science foundation of China: 61472222, and natural science foundation of Shandong province: ZR2012Z002.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Cheng, J., Zhu, D., Zhu, B. (2015). A New Algorithm for Intermediate Dataset Storage in a Cloud-Based Dataflow. In: Wang, J., Yap, C. (eds) Frontiers in Algorithmics. FAW 2015. Lecture Notes in Computer Science(), vol 9130. Springer, Cham. https://doi.org/10.1007/978-3-319-19647-3_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-19647-3_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19646-6
Online ISBN: 978-3-319-19647-3
eBook Packages: Computer ScienceComputer Science (R0)