Eeny Meeny Miny Moe: Choosing the Fault Tolerance Technique for my Cloud Workflow

de Jesus, Leonardo Araújo; Drummond, Lúcia M. A.; de Oliveira, Daniel

doi:10.1007/978-3-319-73353-1_23

Leonardo Araújo de Jesus¹¹,
Lúcia M. A. Drummond¹¹ &
Daniel de Oliveira¹¹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 796))

Included in the following conference series:

Latin American High Performance Computing Conference

881 Accesses

Abstract

Scientific workflows are models composed of activities, data and dependencies whose objective is to represent a computer simulation. Workflows are managed by Scientific Workflow Management System (SWfMS). Such workflows commonly demand for many computational resources once their executions may involve a number of different programs processing a huge volume of data. Thus, the use of High Performance Computing (HPC) environments allied to parallelization techniques provides the support for the execution of such experiments. Some resources provided by clouds can be used to build HPC environments. Although clouds offer advantages such as elasticity and availability, failures are a reality rather than a possibility. Thus, SWfMS must be fault-tolerant. There are several types of fault tolerance techniques used in SWfMS such as checkpoint-restart and replication, but which fault tolerance technique best fits with a specific workflow? This work aims at analyzing several fault tolerance techniques in SWfMSs and recommending the suitable one for the user’s workflow using machine learning techniques and provenance data, thus improving resiliency.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Provenance-based fault tolerance technique recommendation for cloud-based scientific workflows: a practical approach

Article 09 March 2019

Managing Failures in Task-Based Parallel Workflows in Distributed Computing Environments

A fault-tolerant scheduling strategy through proactive and clustering techniques for scientific workflows in cloud computing

Article 01 January 2025

Notes

References

Mattoso, M., Werner, C., Travassos, G.H., Braganholo, V., Ogasawara, E., de Oliveira, D., et al.: Towards supporting the life cycle of large scale scientific experiments. IJBPIM 5(1), 79+ (2010)
Article Google Scholar
Hoffa, C., Mehta, G., Freeman, T., Deelman, E., Keahey, K., Berriman, B., Good, J.: On the use of cloud computing for scientific workflows. In: eScience 2008, pp. 640–645 (2008)
Google Scholar
Vaquero, L.M., Rodero-Merino, L., Caceres, J., Lindner, M.: A break in the clouds: towards a cloud definition. SIGCOMM Rev. 39(1), 50–55 (2008)
Article Google Scholar
Deelman, E., Vahi, K., Juve, G., Rynge, M., Callaghan, S., Maechling, P.J., Mayani, R., Chen, W., da Silva, R.F., Livny, M., et al.: Pegasus, a workflow management system for science automation. FGCS 46, 17–35 (2015)
Article Google Scholar
de Oliveira, D., Ogasawara, E., Baião, F., Mattoso, M.: Scicumulus: a lightweight cloud middleware to explore many task computing paradigm in scientific workflows. In: 3rd International Conference on Cloud Computing, pp. 378–385 (2010)
Google Scholar
Jackson, K.R., Ramakrishnan, L., Runge, K.J., Thomas, R.C.: Seeking supernovae in the clouds: a performance study. In: HPDC 2010, pp. 421–429. ACM, New York (2010)
Google Scholar
Lee, K.-H., Lai, I.-C., Lee, C.-R.: Optimizing back-and-forth live migration. In: Proceedings of the 9th UCC, UCC 2016, pp. 49–54. ACM, New York (2016). https://doi.org/10.1145/2996890.2996909
Freire, J., Koop, D., Santos, E., Silva, C.T.: Provenance for computational tasks: a survey. Comput. Sci. Eng. 10(3), 11–21 (2008)
Article Google Scholar
Hu, M., Luo, J., Wang, Y., Veeravalli, B.: Adaptive scheduling of task graphs with dynamic resilience. IEEE Trans. Comput. 66(1), 17–23 (2017)
Article MathSciNet MATH Google Scholar
Gu, Y., Wu, C.Q., Liu, X., Yu, D.: Distributed throughput optimization for large-scale scientific workflows under fault-tolerance constraint. J. Grid Comput. 11(3), 361–379 (2013)
Article Google Scholar
Bala, A., Chana, I.: Autonomic fault tolerant scheduling approach for scientific workflows in cloud computing. Concurr. Eng. 23(1), 27–39 (2015)
Article Google Scholar
Jain, A., Ong, S.P., Chen, W., Medasani, B., Qu, X., Kocher, M., Brafman, M., Petretto, G., Rignanese, G.-M., Hautier, G., et al.: Fireworks: a dynamic workflow system designed for high-throughput applications. Concurr. Comput. 27(17), 5037–5059 (2015)
Article Google Scholar
Elmroth, E., Hernández, F., Tordsson, J.: A light-weight grid workflow execution engine enabling client and middleware independence. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Wasniewski, J. (eds.) PPAM 2007. LNCS, vol. 4967, pp. 754–761. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-68111-3_79
Chapter Google Scholar
von Laszewski, G., Hategan, M.: Java cog kit karajan/gridant workflow guide. Technical report, Argonne National Laboratory, Argonne, IL, USA (2005)
Google Scholar
Costa, F., de Oliveira, D., Ocaña, K.A.C.S., Ogasawara, E., Mattoso, M.: Enabling re-executions of parallel scientific workflows using runtime provenance data. In: Groth, P., Frew, J. (eds.) IPAW 2012. LNCS, vol. 7525, pp. 229–232. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34222-6_22
Chapter Google Scholar
Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)
Google Scholar
Clark, P., Niblett, T.: The CN2 induction algorithm. Mach. Learn. 3(4), 261–283 (1989)
Google Scholar
Zhang, Y., Mandal, A., Koelbel, C., Cooper, K.: Combined fault tolerance and scheduling techniques for workflow applications on computational grids. In: CC-Grid 2009, pp. 244–251. IEEE Computer Society (2009)
Google Scholar
Hoheisel, A.: Grid workflow execution service-dynamic and interactive execution and visualization of distributed workflows. In: Proceedings of the Cracow Grid Workshop, vol. 2, pp. 13–24. Citeseer (2006)
Google Scholar
Gärtner, F.C.: Fundamentals of fault-tolerant distributed computing in asynchronous environments. ACM CSUR 31(1), 1–26 (1999)
Article Google Scholar
Ocaña, K.A.C.S., de Oliveira, D., Ogasawara, E., Dávila, A.M.R., Lima, A.A.B., Mattoso, M.: SciPhy: a cloud-based workflow for phylogenetic analysis of drug targets in protozoan genomes. In: Norberto de Souza, O., Telles, G.P., Palakal, M. (eds.) BSB 2011. LNCS, vol. 6832, pp. 66–70. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22825-4_9
Chapter Google Scholar
Saavedra-Barrera, R., Culler, D., Von Eicken, T.: Analysis of multithreaded architectures for parallel computing. In: SPAAACM 1990, pp. 169–178. ACM (1990)
Google Scholar
Quinlan, J.R.: Simplifying decision trees. Int. J. Man-Mach. Stud. 27(3), 221–234 (1987)
Article Google Scholar
Ogasawara, E., Dias, J., Silva, V., Chirigati, F., de Oliveira, D., Porto, F., Valduriez, P., Mattoso, M.: Chiron: a parallel engine for algebraic scientific workflows. Concurr. Comput. 25(16), 2327–2341 (2013)
Article Google Scholar
Di, S., Robert, Y., Vivien, F., Kondo, D., Wang, C.-L., Cappello, F.: Optimization of cloud task processing with checkpoint-restart mechanism. In: 2013 International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pp. 1–12. IEEE (2013)
Google Scholar
Young, J.W.: A first order approximation to the optimum checkpoint interval. Commun. ACM 17(9), 530–531 (1974)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Instituto de Computação, Universidade Federal Fluminense (UFF), Niterói, Brazil
Leonardo Araújo de Jesus, Lúcia M. A. Drummond & Daniel de Oliveira

Authors

Leonardo Araújo de Jesus
View author publications
You can also search for this author in PubMed Google Scholar
Lúcia M. A. Drummond
View author publications
You can also search for this author in PubMed Google Scholar
Daniel de Oliveira
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Daniel de Oliveira .

Editor information

Editors and Affiliations

CSC-CONICET and Universidad de Buenos Aires, Buenos Aires, Argentina
Esteban Mocskos
Universidad de la República, Montevideo, Uruguay
Sergio Nesmachnow

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

de Jesus, L.A., Drummond, L.M.A., de Oliveira, D. (2018). Eeny Meeny Miny Moe: Choosing the Fault Tolerance Technique for my Cloud Workflow. In: Mocskos, E., Nesmachnow, S. (eds) High Performance Computing. CARLA 2017. Communications in Computer and Information Science, vol 796. Springer, Cham. https://doi.org/10.1007/978-3-319-73353-1_23

Download citation

DOI: https://doi.org/10.1007/978-3-319-73353-1_23
Published: 28 December 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73352-4
Online ISBN: 978-3-319-73353-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics