Skip to main content
Log in

Science in the Cloud: Allocation and Execution of Data-Intensive Scientific Workflows

  • Published:
Journal of Grid Computing Aims and scope Submit manuscript

Abstract

An important challenge for the adoption of cloud computing in the scientific community remains the efficient allocation and execution of data-intensive scientific workflows to reduce execution time and the size of transferred data. The transferred data overhead is becoming significant with emerging scientific workflows that have input/output files and intermediate data products ranging in the hundreds of gigabytes. The allocation of scientific workflows on public clouds can be described through a variety of perspectives and parameters, and has been proved to be NP-complete. This paper proposes an evolutionary approach for task allocation on public clouds considering data transfer and execution time. In our framework, a solution is represented using an allocation chromosome that encodes the allocation of tasks to nodes, and an ordering chromosome that defines the execution order according to the scientific workflow representation. We propose a multi-objective optimization that relies on a cloud cost model and employs tailored evolution operators. Starting from a population of possible solutions, we employ crossover and mutation operators on both chromosomes aiming at optimizing the data transferred between nodes as well as the total workflow runtime. The crossover operators combine parts of solutions to reduce data overhead, whereas the mutation operators swamp between parts of the same chromosome according to pre-defined rules. Our experimental study compares between the proposed approach and current state-of-the art approaches using synthetic and real-life workflows. Our algorithm performs similarly to existing heuristics for small workflows and shows up to 80 % improvements for larger synthetic workflows. To further validate our approach we compare between the allocation and scheduling obtained by our approach with that obtained by popular scientific workflow managers, when real workflows with hundreds of tasks are executed on a public cloud. The results show a 10 % improvement in runtime over existing schedulers, caused by a 80 % reduction in transferred data and optimized allocation and ordering of tasks. This improved data locality has greater impact as it can be employed to improve and study data provenance and facilitate data persistence for scientific workflows.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Abramson, D., Enticott, C., Altinas, I.: Nimrod/K: towards massively parallel dynamic grid workflows. In: Proceedings of the ACM/IEEE Conference on Supercomputing, pp. 24:1–24:11 (2008)

  2. Arstechnica: $1,279 per hour, 30,000-core cluster built on Amazon EC2 cloud. http://arstechnica.com/business/news/2011/09/30000-core-cluster-built-on-amazon-ec2-cloud.ars. Last retrieved Oct. 2012

  3. Berriman, G.B., Deelman, E., Groth, P.T., Juve, G.: The application of cloud computing to the creation of image Mosaics and management of their provenance. In: CoRR (2010)

  4. Bharathi, S., Chervenak, A., Deelman, E., Mehta, G., Su, M.-H., Vahi, K.: Characterization of scientific workflows. In: Proceedings of the Third Workshop on Workflows in Support of Large-Scale Science, pp. 1–10 (2008)

  5. Catalyuek, U.V., Kaya, K., Ucar, B.: Integrated data placement and task assignment for scientific workflows in clouds. In: Proceedings of the Fourth International Workshop on Data Intensive Distributed Computing (2009)

  6. Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 6(2), 182–197 (2002)

    Article  Google Scholar 

  7. Deelman, E., Singh, G., Livny, M., Berriman, B., Good, J.: The cost of doing science on the cloud: the Montage example. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12 (2008)

  8. Deelman, E., Singh, G., Su, M.-H., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Vahi, K., Berriman, G.B., Good, J., Laity, A., Jacob, J.C., Katz, D.S.: Pegasus: a framework for mapping complex scientific workflows onto distributed systems. Sci. Program. 13, 219–237 (2005)

    Google Scholar 

  9. Durillo, J., Nebro, A., Luna, F., Dorronsoro, B., Alba, E.: jMetal: a java framework for developing multi-objective optimization metaheuristics. University of Málaga, Technical Report ITI-2006-10 (2006)

  10. Evangelinos, C., Hill, C.N.: Cloud computing for parallel scientific HPC applications: feasibility of running coupled atmosphere-ocean climate models on Amazon’s EC2. In: Cloud Computing and its Applications (2008)

  11. Fernandez-Baca, D.: Allocating modules to processors in a distributed system. IEEE Trans. Softw. Eng. 15, 1427–1436 (1989)

    Article  Google Scholar 

  12. Holland, J.: Adaptation in Natural and Artificial Systems. University of Michigan Press (1975)

  13. JSwarm: PSO optimization package. http://jswarm-pso.sourceforge.net/. Last retrieved Oct. 2012

  14. Juve, G., Deelman, E., Vahi, K., Mehta, G., Berman, B.P., Berriman, B., Maechling, P.: Scientific workflow applications on Amazon EC2. In: Proceedings of the International Conference on E-Science, pp. 59–66 (2009)

  15. Katz, D.S., Jacob, J.C., Berriman, G., Good, J., Laity, A.C., Deelman, E., Kesselman, C., Singh, G., Su, M.-H., Prince, T.A.: A comparison of two methods for building astronomical image mosaics on a Grid. In: Proceedings of the International Conference on Parallel Processing, pp. 85–94 (2005)

  16. Kennedy, J., Eberhart, R.: Particle swarm optimization. In: Proceedings of the IEEE International Conference on Neural Networks, pp. 1942–1948 (1995)

  17. Kwok, Y.-K., Ahmad, I.: Efficient scheduling of arbitrary task graphs to multiprocessors using a parallel genetic algorithm. J. Parallel Distrib. Comput. 47, 58–77 (1997)

    Article  Google Scholar 

  18. Ludascher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger, E., Jones, M., Lee, E.A., Tao, J., Zhao, Y.: Scientific workflow management and the Kepler system. Concurr. Comput. Pract. Exp. 18, 1039–1065 (2006)

    Article  Google Scholar 

  19. National Institutes of Standards and Technology: Cloud computing synopsis and recommendations. http://csrc.nist.gov/publications/drafts/800-146/Draft-NIST-SP800-146.pdf. Last retrieved Oct. 2012

  20. Oinn, T., Addis, M., Ferris, J., Marvin, D., Senger, M., Greenwood, M., Carver, T., Glover, K., Pocock, M.R., Wipat, A., Li, P.: Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20, 3045–3054 (2004)

    Article  Google Scholar 

  21. Pandey, S., Wu, L., Guru, S., Buyya, R.: A particle swarm optimization-based heuristic for scheduling workflow applications in cloud computing environments. In: Proceedings of the 24th IEEE International Conference on Advanced Information Networking and Applications, pp. 400–407 (2010)

  22. Pegasus Project: Workflow generator. https://confluence.pegasus.isi.edu/display/pegasus/WorkflowGenerator. Last retrieved Oct. 2012

  23. Prodan, R., Wieczorek, M.: Negotiation-based scheduling of scientific Grid workflows through advance reservations. J. Grid Comput. 8(4), 493–510 (2010)

    Article  Google Scholar 

  24. Prodan, R., Wieczorek, M., Fard, H.M.: Double auction-based scheduling of scientific applications in distributed Grid and cloud environments. J. Grid Comput. 9(4), 531–548 (2011)

    Article  Google Scholar 

  25. Shibata, T., Choi, S., Taura, K.: File-access patterns of data-intensive workflow applications. In: Proceedings of the 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, pp. 522–525 (2010)

  26. Singh, G., Livny, M., Berriman, B., Good, J.: The cost of doing science on the cloud: the Montage example. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12 (2008)

  27. Szabo, C., Kroeger, T.: Evolving multi-objective strategies for task allocation of scientific workflows on public clouds. In: IEEE Congress on Evolutionary Computation, pp. 1–8 (2012)

  28. Thains, D., Tannenbaum, T., Livny, M.: Distributed computing in practice: the condor experience. Concurr. Comput. Pract. Exp. 17(2-4), 323–356 (2005)

    Article  Google Scholar 

  29. Vockler, J.-S., Juve, G., Deelman, E., Rynge, M., Berriman, G.B.: Experiences using cloud computing for a scientific workflow application. In: Proceedings of the 2nd Workshop on Scientific Cloud Computing (2011)

  30. Vockler, J.-S., Juve, G., Deelman, E., Rynge, M., Berriman, G.B.: Integration of workflow partitioning and resource provisioning. In: Proceedings of the IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 764–768 (2012)

  31. Walker, E.: Benchmarking Amazon EC2 for high-performance scientific computing. USENIX Login 33, 18–23 (2008)

    Google Scholar 

  32. Wu, Z., Ni, Z., Gu, L., Liu, X.: A revised discrete particle swarm optimization for cloud workflow scheduling. In: International Conference on Computational Intelligence and Security, pp. 184–188 (2010)

  33. Yigitbasi, N., Iosup, A., Epema, D.: C-meter: a framework for performance analysis of computing clouds. In: Cluster Computing and the Grid, pp. 472–477 (2009)

  34. Yu, J., Buyya, R.: A taxonomy of workflow management systems for grid computing. J. Grid Comput. 3(3–4), 171–200 (2005)

    Article  Google Scholar 

  35. Yuan, D., Yang, Y., Liu, X., Chen, J.: A data placement strategy in scientific cloud workflows. Futur. Gener. Comput. Syst. 26, 1200–1214 (2010)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Claudia Szabo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Szabo, C., Sheng, Q.Z., Kroeger, T. et al. Science in the Cloud: Allocation and Execution of Data-Intensive Scientific Workflows. J Grid Computing 12, 245–264 (2014). https://doi.org/10.1007/s10723-013-9282-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10723-013-9282-3

Keywords

Navigation