Abstract
Scientific workflows are models composed of activities, data and dependencies whose objective is to represent a computer simulation. Workflows are managed by Scientific Workflow Management System (SWfMS). Such workflows commonly demand for many computational resources once their executions may involve a number of different programs processing a huge volume of data. Thus, the use of High Performance Computing (HPC) environments allied to parallelization techniques provides the support for the execution of such experiments. Some resources provided by clouds can be used to build HPC environments. Although clouds offer advantages such as elasticity and availability, failures are a reality rather than a possibility. Thus, SWfMS must be fault-tolerant. There are several types of fault tolerance techniques used in SWfMS such as checkpoint-restart and replication, but which fault tolerance technique best fits with a specific workflow? This work aims at analyzing several fault tolerance techniques in SWfMSs and recommending the suitable one for the user’s workflow using machine learning techniques and provenance data, thus improving resiliency.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Mattoso, M., Werner, C., Travassos, G.H., Braganholo, V., Ogasawara, E., de Oliveira, D., et al.: Towards supporting the life cycle of large scale scientific experiments. IJBPIM 5(1), 79+ (2010)
Hoffa, C., Mehta, G., Freeman, T., Deelman, E., Keahey, K., Berriman, B., Good, J.: On the use of cloud computing for scientific workflows. In: eScience 2008, pp. 640–645 (2008)
Vaquero, L.M., Rodero-Merino, L., Caceres, J., Lindner, M.: A break in the clouds: towards a cloud definition. SIGCOMM Rev. 39(1), 50–55 (2008)
Deelman, E., Vahi, K., Juve, G., Rynge, M., Callaghan, S., Maechling, P.J., Mayani, R., Chen, W., da Silva, R.F., Livny, M., et al.: Pegasus, a workflow management system for science automation. FGCS 46, 17–35 (2015)
de Oliveira, D., Ogasawara, E., Baião, F., Mattoso, M.: Scicumulus: a lightweight cloud middleware to explore many task computing paradigm in scientific workflows. In: 3rd International Conference on Cloud Computing, pp. 378–385 (2010)
Jackson, K.R., Ramakrishnan, L., Runge, K.J., Thomas, R.C.: Seeking supernovae in the clouds: a performance study. In: HPDC 2010, pp. 421–429. ACM, New York (2010)
Lee, K.-H., Lai, I.-C., Lee, C.-R.: Optimizing back-and-forth live migration. In: Proceedings of the 9th UCC, UCC 2016, pp. 49–54. ACM, New York (2016). https://doi.org/10.1145/2996890.2996909
Freire, J., Koop, D., Santos, E., Silva, C.T.: Provenance for computational tasks: a survey. Comput. Sci. Eng. 10(3), 11–21 (2008)
Hu, M., Luo, J., Wang, Y., Veeravalli, B.: Adaptive scheduling of task graphs with dynamic resilience. IEEE Trans. Comput. 66(1), 17–23 (2017)
Gu, Y., Wu, C.Q., Liu, X., Yu, D.: Distributed throughput optimization for large-scale scientific workflows under fault-tolerance constraint. J. Grid Comput. 11(3), 361–379 (2013)
Bala, A., Chana, I.: Autonomic fault tolerant scheduling approach for scientific workflows in cloud computing. Concurr. Eng. 23(1), 27–39 (2015)
Jain, A., Ong, S.P., Chen, W., Medasani, B., Qu, X., Kocher, M., Brafman, M., Petretto, G., Rignanese, G.-M., Hautier, G., et al.: Fireworks: a dynamic workflow system designed for high-throughput applications. Concurr. Comput. 27(17), 5037–5059 (2015)
Elmroth, E., Hernández, F., Tordsson, J.: A light-weight grid workflow execution engine enabling client and middleware independence. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Wasniewski, J. (eds.) PPAM 2007. LNCS, vol. 4967, pp. 754–761. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-68111-3_79
von Laszewski, G., Hategan, M.: Java cog kit karajan/gridant workflow guide. Technical report, Argonne National Laboratory, Argonne, IL, USA (2005)
Costa, F., de Oliveira, D., Ocaña, K.A.C.S., Ogasawara, E., Mattoso, M.: Enabling re-executions of parallel scientific workflows using runtime provenance data. In: Groth, P., Frew, J. (eds.) IPAW 2012. LNCS, vol. 7525, pp. 229–232. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34222-6_22
Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)
Clark, P., Niblett, T.: The CN2 induction algorithm. Mach. Learn. 3(4), 261–283 (1989)
Zhang, Y., Mandal, A., Koelbel, C., Cooper, K.: Combined fault tolerance and scheduling techniques for workflow applications on computational grids. In: CC-Grid 2009, pp. 244–251. IEEE Computer Society (2009)
Hoheisel, A.: Grid workflow execution service-dynamic and interactive execution and visualization of distributed workflows. In: Proceedings of the Cracow Grid Workshop, vol. 2, pp. 13–24. Citeseer (2006)
Gärtner, F.C.: Fundamentals of fault-tolerant distributed computing in asynchronous environments. ACM CSUR 31(1), 1–26 (1999)
Ocaña, K.A.C.S., de Oliveira, D., Ogasawara, E., Dávila, A.M.R., Lima, A.A.B., Mattoso, M.: SciPhy: a cloud-based workflow for phylogenetic analysis of drug targets in protozoan genomes. In: Norberto de Souza, O., Telles, G.P., Palakal, M. (eds.) BSB 2011. LNCS, vol. 6832, pp. 66–70. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22825-4_9
Saavedra-Barrera, R., Culler, D., Von Eicken, T.: Analysis of multithreaded architectures for parallel computing. In: SPAAACM 1990, pp. 169–178. ACM (1990)
Quinlan, J.R.: Simplifying decision trees. Int. J. Man-Mach. Stud. 27(3), 221–234 (1987)
Ogasawara, E., Dias, J., Silva, V., Chirigati, F., de Oliveira, D., Porto, F., Valduriez, P., Mattoso, M.: Chiron: a parallel engine for algebraic scientific workflows. Concurr. Comput. 25(16), 2327–2341 (2013)
Di, S., Robert, Y., Vivien, F., Kondo, D., Wang, C.-L., Cappello, F.: Optimization of cloud task processing with checkpoint-restart mechanism. In: 2013 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 1–12. IEEE (2013)
Young, J.W.: A first order approximation to the optimum checkpoint interval. Commun. ACM 17(9), 530–531 (1974)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this paper
Cite this paper
de Jesus, L.A., Drummond, L.M.A., de Oliveira, D. (2018). Eeny Meeny Miny Moe: Choosing the Fault Tolerance Technique for my Cloud Workflow. In: Mocskos, E., Nesmachnow, S. (eds) High Performance Computing. CARLA 2017. Communications in Computer and Information Science, vol 796. Springer, Cham. https://doi.org/10.1007/978-3-319-73353-1_23
Download citation
DOI: https://doi.org/10.1007/978-3-319-73353-1_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73352-4
Online ISBN: 978-3-319-73353-1
eBook Packages: Computer ScienceComputer Science (R0)