Abstract
Workflows are used to orchestrate data-intensive applications in many different scientific domains. Workflow applications typically communicate data between processing steps using intermediate files. When tasks are distributed, these files are either transferred from one computational node to another, or accessed through a shared storage system. As a result, the efficient management of data is a key factor in achieving good performance for workflow applications in distributed environments. In this paper we investigate some of the ways in which data can be managed for workflows in the cloud. We ran experiments using three typical workflow applications on Amazon’s EC2 cloud computing platform. We discuss the various storage and file systems we used, describe the issues and problems we encountered deploying them on EC2, and analyze the resulting performance and cost of the workflows.
Similar content being viewed by others
References
Amazon.com: Elastic Compute Cloud (EC2). http://aws.amazon.com/ec2. Accessed 9 Mar 2012
Amazon.com: Simple Storage Service (S3). http://aws.amazon.com/s3. Accessed 9 Mar 2012
Callaghan, S., Deelman, E., Gunter, D., Juve, G., Maechling, P., Brooks, C., Vahi, K., Milner, K., Graves, R., Field, E., Okaya, D., Jordan, T.: Scaling up workflow-based applications. J. Comput. Syst. Sci. 76(6), 428–446 (2010)
Carns, P., Ligon, W., Ross, R., Thakur, R.: PVFS: A parallel file system for linux clusters. In: 4th Annual Linux Showcase and Conference (2000)
Chase, J.S., Irwin, D.E., Grit, L.E., Moore, J.D., Sprenkle, S.E.: Dynamic virtual clusters in a grid site manager. In: Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing (HPDC03) (2003)
DAGMan: http://cs.wisc.edu/condor/dagman. Accessed 9 Mar 2012
Deelman, E., Singh, G., Su, M.-H., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Vahi, K., Berriman, G.B., Good, J., Laity, A., Jacob, J.C., Katz, D.S.: Pegasus: A framework for mapping complex scientific workflows onto distributed systems. Sci. Program. 13(3), 219–237 (2005)
Evangelinos, C., Hill, C.N.: Cloud computing for parallel scientific HPC applications: Feasibility of running coupled atmosphere-ocean climate models on Amazon’s EC2. In: Cloud Computing and Its Applications (CCA 2008) (2008)
Figueiredo, R.J., Dinda, P.A., Fortes, J.A.B.: A case for grid computing on virtual machines. In: 23rd International Conference on Distributed Computing Systems (2003)
Foster, I., Freeman, T., Keahey, K., Scheftner, D., Sotomayer, B., Zhang, X.: Virtual clusters for grid communities. In: Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID06) (2006)
Gluster, Inc.: GlusterFS. http://www.gluster.org. Accessed 9 Mar 2012
Huang, W., Liu, J., Abali, B., Panda, D.K.: A case for high performance computing with virtual machines. In: 20th annual international conference on Supercomputing (ICS 06) (2006)
Hupfeld, F., Cortes, T., Kolbeck, B., Stender, J., Focht, E., Hess, M., Malo, J., Marti, J., Cesario, E.: The XtreemFS architecture—a case for object-based file systems in Grids. Concurrency Comput. Pract. Ex. 20(17), 2049–2060 (2008)
Juve, G., Deelman, E., Vahi, K., Mehta, G.: Scientific workflow applications on Amazon EC2. In: Workshop on Cloud-based Services and Applications in conjunction with 5th IEEE International Conference on e-Science (e-Science 2009) (2009)
Juve, G., Deelman, E.: Automating application deployment in infrastructure clouds. In: 3rd IEEE International Conference on Cloud Computing Technology and Science (CloudCom) (2011)
Kärkkäinen, P., Kurth, L.: XenOverview—Xen Wiki. http://wiki.xensource.com/xenwiki/XenOverview. Accessed 9 Mar 2012
Katz, D.S., Jacob, J.C., Deelman, E., Kesselman, C., Gurmeet, S., Mei-Hui, S., Berriman, G.B., Good, J., Laity, A.C., Prince, T.A.: A comparison of two methods for building astronomical image mosaics on a grid. In: 34th International Conference on Parallel Processing Workshops (ICPP ’05) (2005)
Lagouvardos, K., Floros, E., Kotroni, V.: A grid-enabled regional-scale ensemble forecasting system in the Mediterranean area. J. Grid Computing 8(2), 181–197 (2010)
Li, H., Ruan, J., Durbin, R.: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18(11), 1851–1858 (2008)
Litzkow, M.J., Livny, M., Mutka, M.W.: Condor: A hunter of idle workstations. In: 8th International Conference of Distributed Computing Systems (1988)
Napper, J., Bientinesi, P.: Can cloud computing reach the top500? In: Proceedings of the Workshop on UnConventional High Performance Computing (2009)
NASA Advanced Supercomputing Division: NAS parallel benchmarks. http://www.nas.nasa.gov/Resources/Software/npb.html. Accessed 9 Mar 2012
Oracle Corporation: Lustre parallel filesystem. http://www.lustre.org. Accessed 9 Mar 2012
Ostermann, S., Iosup, A., Yigitbasi, N., Prodan, R., Fahringer, T., Epema, D.: A performance analysis of ec2 cloud computing services for scientific computing. In: Proceedings of Cloudcomp 2009 (2009)
Palankar, M.R., Iamnitchi, A., Ripeanu, M., Garfinkel, S.: Amazon S3 for science grids: A viable solution? In: Proceedings of the 2008 international workshop on Data-aware distributed computing (DADC 08) (2008)
ptrace(2)—process trace (man page). In: Linux Programmer’s Manual. Retrieved from: http://www.kernel.org/doc/man-pages/online/pages/man2/ptrace.2.html. Accessed 30 Mar 2009
Sandberg, R., Golgberg, D., Kleiman, S., Walsh, D., Lyon, B.: Design and implementation of the sun network filesystem. In: USENIX Conference Proceedings (1985)
Singh, G., Kesselman, C., Deelman, E.: Optimizing grid-based workflow execution. J. Grid Computing 3(3–4), 201–219 (2005)
Southern California Earthquake Center, Broadband Platform. http://scec.usc.edu/scecpedia/Broadband_Platform. Accessed 9 Mar 2012
Taylor, I.J., Deelman, E., Gannon, D.B., Shields, M.: Workflows for e-Science: Scientific workflows for grids. Springer New York, Inc. (2006)
USC Epigenome Center. http://epigenome.usc.edu. Accessed 9 Mar 2012
Vecchiola, C., Pandey, S., Buyya, R.: High-performance cloud computing: A view of scientific applications. In: International Symposium on Parallel Architectures, Algorithms, and Networks (2009)
Walker, E.: Benchmarking Amazon EC2 for high-performance scientific computing. Login 33(5), 18–23
Wang, Y., Mehta, G., Mayani, R., Lu, J., Souaiaia, T., Chen, Y., Clark, A., Yoon, H.J., Wan, L., Evgrafov, O.V., Knowles, J.A., Deelman, E., Chen, T.: RseqFlow: Workflows for RNA-Seq data analysis. Bioinformatics (18), 2598–2600 (2011)
Weil, S.A., Brandt, S.A., Miller, E.L., Long, D.D.E., Maltzahn, C.: Ceph: A scalable, high-performance distributed file system. In: 7th Symposium on Operating Systems Design and Implementation (OSDI 06) (2006)
Youseff, L., Wolski, R., Gorda, B., Krintz, C.: Paravirtualization for HPC systems. In: Workshop on Xen in High-Performance Cluster and Grid Computing (2006)
Yu, W., Vetter, J.S.: Xen-Based HPC: A parallel I/O perspective. In: 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid ’08) (2008)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Juve, G., Deelman, E., Berriman, G.B. et al. An Evaluation of the Cost and Performance of Scientific Workflows on Amazon EC2. J Grid Computing 10, 5–21 (2012). https://doi.org/10.1007/s10723-012-9207-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10723-012-9207-6