Skip to main content
Log in

An Evaluation of the Cost and Performance of Scientific Workflows on Amazon EC2

  • Published:
Journal of Grid Computing Aims and scope Submit manuscript

Abstract

Workflows are used to orchestrate data-intensive applications in many different scientific domains. Workflow applications typically communicate data between processing steps using intermediate files. When tasks are distributed, these files are either transferred from one computational node to another, or accessed through a shared storage system. As a result, the efficient management of data is a key factor in achieving good performance for workflow applications in distributed environments. In this paper we investigate some of the ways in which data can be managed for workflows in the cloud. We ran experiments using three typical workflow applications on Amazon’s EC2 cloud computing platform. We discuss the various storage and file systems we used, describe the issues and problems we encountered deploying them on EC2, and analyze the resulting performance and cost of the workflows.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Amazon.com: Elastic Compute Cloud (EC2). http://aws.amazon.com/ec2. Accessed 9 Mar 2012

  2. Amazon.com: Simple Storage Service (S3). http://aws.amazon.com/s3. Accessed 9 Mar 2012

  3. Callaghan, S., Deelman, E., Gunter, D., Juve, G., Maechling, P., Brooks, C., Vahi, K., Milner, K., Graves, R., Field, E., Okaya, D., Jordan, T.: Scaling up workflow-based applications. J. Comput. Syst. Sci. 76(6), 428–446 (2010)

    Article  Google Scholar 

  4. Carns, P., Ligon, W., Ross, R., Thakur, R.: PVFS: A parallel file system for linux clusters. In: 4th Annual Linux Showcase and Conference (2000)

  5. Chase, J.S., Irwin, D.E., Grit, L.E., Moore, J.D., Sprenkle, S.E.: Dynamic virtual clusters in a grid site manager. In: Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing (HPDC03) (2003)

  6. DAGMan: http://cs.wisc.edu/condor/dagman. Accessed 9 Mar 2012

  7. Deelman, E., Singh, G., Su, M.-H., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Vahi, K., Berriman, G.B., Good, J., Laity, A., Jacob, J.C., Katz, D.S.: Pegasus: A framework for mapping complex scientific workflows onto distributed systems. Sci. Program. 13(3), 219–237 (2005)

    Google Scholar 

  8. Evangelinos, C., Hill, C.N.: Cloud computing for parallel scientific HPC applications: Feasibility of running coupled atmosphere-ocean climate models on Amazon’s EC2. In: Cloud Computing and Its Applications (CCA 2008) (2008)

  9. Figueiredo, R.J., Dinda, P.A., Fortes, J.A.B.: A case for grid computing on virtual machines. In: 23rd International Conference on Distributed Computing Systems (2003)

  10. Foster, I., Freeman, T., Keahey, K., Scheftner, D., Sotomayer, B., Zhang, X.: Virtual clusters for grid communities. In: Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID06) (2006)

  11. Gluster, Inc.: GlusterFS. http://www.gluster.org. Accessed 9 Mar 2012

  12. Huang, W., Liu, J., Abali, B., Panda, D.K.: A case for high performance computing with virtual machines. In: 20th annual international conference on Supercomputing (ICS 06) (2006)

  13. Hupfeld, F., Cortes, T., Kolbeck, B., Stender, J., Focht, E., Hess, M., Malo, J., Marti, J., Cesario, E.: The XtreemFS architecture—a case for object-based file systems in Grids. Concurrency Comput. Pract. Ex. 20(17), 2049–2060 (2008)

    Article  Google Scholar 

  14. Juve, G., Deelman, E., Vahi, K., Mehta, G.: Scientific workflow applications on Amazon EC2. In: Workshop on Cloud-based Services and Applications in conjunction with 5th IEEE International Conference on e-Science (e-Science 2009) (2009)

  15. Juve, G., Deelman, E.: Automating application deployment in infrastructure clouds. In: 3rd IEEE International Conference on Cloud Computing Technology and Science (CloudCom) (2011)

  16. Kärkkäinen, P., Kurth, L.: XenOverview—Xen Wiki. http://wiki.xensource.com/xenwiki/XenOverview. Accessed 9 Mar 2012

  17. Katz, D.S., Jacob, J.C., Deelman, E., Kesselman, C., Gurmeet, S., Mei-Hui, S., Berriman, G.B., Good, J., Laity, A.C., Prince, T.A.: A comparison of two methods for building astronomical image mosaics on a grid. In: 34th International Conference on Parallel Processing Workshops (ICPP ’05) (2005)

  18. Lagouvardos, K., Floros, E., Kotroni, V.: A grid-enabled regional-scale ensemble forecasting system in the Mediterranean area. J. Grid Computing 8(2), 181–197 (2010)

    Article  Google Scholar 

  19. Li, H., Ruan, J., Durbin, R.: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18(11), 1851–1858 (2008)

    Article  Google Scholar 

  20. Litzkow, M.J., Livny, M., Mutka, M.W.: Condor: A hunter of idle workstations. In: 8th International Conference of Distributed Computing Systems (1988)

  21. Napper, J., Bientinesi, P.: Can cloud computing reach the top500? In: Proceedings of the Workshop on UnConventional High Performance Computing (2009)

  22. NASA Advanced Supercomputing Division: NAS parallel benchmarks. http://www.nas.nasa.gov/Resources/Software/npb.html. Accessed 9 Mar 2012

  23. Oracle Corporation: Lustre parallel filesystem. http://www.lustre.org. Accessed 9 Mar 2012

  24. Ostermann, S., Iosup, A., Yigitbasi, N., Prodan, R., Fahringer, T., Epema, D.: A performance analysis of ec2 cloud computing services for scientific computing. In: Proceedings of Cloudcomp 2009 (2009)

  25. Palankar, M.R., Iamnitchi, A., Ripeanu, M., Garfinkel, S.: Amazon S3 for science grids: A viable solution? In: Proceedings of the 2008 international workshop on Data-aware distributed computing (DADC 08) (2008)

  26. ptrace(2)—process trace (man page). In: Linux Programmer’s Manual. Retrieved from: http://www.kernel.org/doc/man-pages/online/pages/man2/ptrace.2.html. Accessed 30 Mar 2009

  27. Sandberg, R., Golgberg, D., Kleiman, S., Walsh, D., Lyon, B.: Design and implementation of the sun network filesystem. In: USENIX Conference Proceedings (1985)

  28. Singh, G., Kesselman, C., Deelman, E.: Optimizing grid-based workflow execution. J. Grid Computing 3(3–4), 201–219 (2005)

    Article  Google Scholar 

  29. Southern California Earthquake Center, Broadband Platform. http://scec.usc.edu/scecpedia/Broadband_Platform. Accessed 9 Mar 2012

  30. Taylor, I.J., Deelman, E., Gannon, D.B., Shields, M.: Workflows for e-Science: Scientific workflows for grids. Springer New York, Inc. (2006)

  31. USC Epigenome Center. http://epigenome.usc.edu. Accessed 9 Mar 2012

  32. Vecchiola, C., Pandey, S., Buyya, R.: High-performance cloud computing: A view of scientific applications. In: International Symposium on Parallel Architectures, Algorithms, and Networks (2009)

  33. Walker, E.: Benchmarking Amazon EC2 for high-performance scientific computing. Login 33(5), 18–23

  34. Wang, Y., Mehta, G., Mayani, R., Lu, J., Souaiaia, T., Chen, Y., Clark, A., Yoon, H.J., Wan, L., Evgrafov, O.V., Knowles, J.A., Deelman, E., Chen, T.: RseqFlow: Workflows for RNA-Seq data analysis. Bioinformatics (18), 2598–2600 (2011)

  35. Weil, S.A., Brandt, S.A., Miller, E.L., Long, D.D.E., Maltzahn, C.: Ceph: A scalable, high-performance distributed file system. In: 7th Symposium on Operating Systems Design and Implementation (OSDI 06) (2006)

  36. Youseff, L., Wolski, R., Gorda, B., Krintz, C.: Paravirtualization for HPC systems. In: Workshop on Xen in High-Performance Cluster and Grid Computing (2006)

  37. Yu, W., Vetter, J.S.: Xen-Based HPC: A parallel I/O perspective. In: 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid ’08) (2008)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gideon Juve.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Juve, G., Deelman, E., Berriman, G.B. et al. An Evaluation of the Cost and Performance of Scientific Workflows on Amazon EC2. J Grid Computing 10, 5–21 (2012). https://doi.org/10.1007/s10723-012-9207-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10723-012-9207-6

Keywords

Navigation