Skip to main content

Workflow Systems for Big Data Analysis

  • Living reference work entry
  • Latest version View entry history
  • First Online:
Encyclopedia of Big Data Technologies

Definitions

A workflow is a well-defined, and possibly repeatable, pattern or systematic organization of activities designed to achieve a certain transformation of data (Talia et al. 2015).

A Workflow Management System (WMS) is a software environment providing tools to define, compose, map, and execute workflows.

Overview

The wide availability of high-performance computing systems has allowed scientists and engineers to implement more and more complex applications for accessing and analyzing huge amounts of data (Big Data) on distributed and high-performance computing platforms. Given the variety of Big Data applications and types of users (from end users to skilled programmers), there is a need for scalable programming models with different levels of abstractions and design formalisms. The programming models should adapt to user needs by allowing (i) ease in defining data analysis applications, (ii) effectiveness in the analysis of large datasets, (iii) and efficiency of executing...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

References

  • Agapito G, Cannataro M, Guzzi PH, Marozzo F, Talia D, Trunfio P (2013) Cloud4snp: distributed analysis of snp microarray data on the cloud. In: Proceedings of the ACM conference on bioinformatics, computational biology and biomedical informatics 2013 (ACM BCB 2013). ACM, Washington, DC, p 468. ISBN:978-1-4503-2434-2

    Google Scholar 

  • Altomare A, Cesario E, Comito C, Marozzo F, Talia D (2017) Trajectory pattern mining for urban computing in the cloud. Trans Parallel Distrib Syst 28(2):586–599. ISSN:1045-9219

    Google Scholar 

  • Atay M, Chebotko A, Liu D, Lu S, Fotouhi F (2007) Efficient schema-based XML-to-relational data mapping. Inf Syst 32(3):458–476

    Google Scholar 

  • Belcastro L, Marozzo F, Talia D, Trunfio P (2015) Programming visual and script-based big data analytics workflows on clouds. In: Grandinetti L, Joubert G, Kunze M, Pascucci V (eds) Post-proceedings of the high performance computing workshop 2014. Advances in parallel computing, vol 26. IOS Press, Cetraro, pp 18–31. ISBN:978-1-61499-582-1

    Google Scholar 

  • Belcastro L, Marozzo F, Talia D, Trunfio P (2016) Using scalable data mining for predicting flight delays. ACM Trans Intell Syst Technol 8(1):1–20

    Google Scholar 

  • Bowers S, Ludascher B, Ngu AHH, Critchlow T (2006) Enabling scientific workflow reuse through structured composition of dataflow and control-flow. In: 22nd international conference on data engineering workshops (ICDEW’06), pp 70–70. https://doi.org/10.1109/ICDEW.2006.55

  • Brown DA, Brady PR, Dietz A, Cao J, Johnson B, McNabb J (2007) A case study on the use of workflow technologies for scientific analysis: gravitational wave data analysis. Workflows for e-Science, pp 39–59

    Google Scholar 

  • Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113

    Article  Google Scholar 

  • Deelman E, Gannon D, Shields M, Taylor I (2009) Workflows and e-science: an overview of workflow system features and capabilities. Futur Gener Comput Syst 25(5):528–540

    Article  Google Scholar 

  • Deelman E, Vahi K, Juve G, Rynge M, Callaghan S, Maechling PJ, Mayani R, Chen W, da Silva RF, Livny M et al (2015) Pegasus, a workflow management system for science automation. Futur Gener Comput Syst 46:17–35

    Article  Google Scholar 

  • Georgakopoulos D, Hornick M, Sheth A (1995) An overview of workflow management: from process modeling to workflow automation infrastructure. Distrib Parallel Databases 3(2):119–153

    Article  Google Scholar 

  • Gropp W, Lusk E, Skjellum A (1999) Using MPI: portable parallel programming with the message-passing interface, vol 1. MIT press, Cambridge

    MATH  Google Scholar 

  • Guan Z, Hernandez F, Bangalore P, Gray J, Skjellum A, Velusamy V, Liu Y (2006) Grid-flow: a grid-enabled scientific workflow system with a petri-net-based interface. Concurr Comput Pract Exp 18(10):1115–1140

    Article  Google Scholar 

  • Isard M, Budiu M, Yu Y, Birrell A, Fetterly D (2007) Dryad: distributed data-parallel programs from sequential building blocks. In: ACM SIGOPS operating systems review, vol 41. ACM, pp 59–72

    Google Scholar 

  • Juric MB, Mathew B, Sarang PG (2006) Business process execution language for web services: an architect and developer’s guide to orchestrating web services using BPEL4WS. Packt Publishing Ltd, Birmingham

    Google Scholar 

  • Juve G, Deelman E, Vahi K, Mehta G, Berriman B, Berman BP, Maechling P (2009) Scientific workflow applications on Amazon EC2. In: 2009 5th IEEE international conference on E-science workshops. IEEE, pp 59–66

    Google Scholar 

  • Kranjc J, Podpečan V, Lavrač N (2012) Clowdflows: a cloud based scientific workflow platform. In: Machine learning and knowledge discovery in databases. Springer, pp 816–819

    Google Scholar 

  • Lee S, Park H, Shin Y (2012) Cloud computing availability: multi-clouds for big data service. In: Convergence and hybrid information technology. Springer, Heidelberg, pp 799–806

    Chapter  Google Scholar 

  • Liu L, Pu C, Ruiz DD (2004) A systematic approach to flexible specification, composition, and restructuring of workflow activities. J Database Manag 15(1):1

    Article  Google Scholar 

  • Lordan F, Tejedor E, Ejarque J, Rafanell R, Álvarez J, Marozzo F, Lezzi D, Sirvent R, Talia D, Badia R (2014) Servicess: an interoperable programming framework for the cloud. J Grid Comput 12(1):67–91

    Article  Google Scholar 

  • Lu Q, Hao P, Curcin V, He W, Li YY, Luo QM, Guo YK, Li YX (2006) KDE bioscience: platform for bioinformatics analysis workflows. J Biomed Inf 39(4):440–450

    Article  Google Scholar 

  • Ludäscher B, Altintas I, Berkley C, Higgins D, Jaeger E, Jones M, Lee EA, Tao J, Zhao Y (2006) Scientific workflow management and the kepler system. Concurr Comput Pract Exp 18(10):1039–1065

    Article  Google Scholar 

  • Maheshwari K, Rodriguez A, Kelly D, Madduri R, Wozniak J, Wilde M, Foster I (2013) Enabling multi-task computation on galaxy-based gateways using swift. In: 2013 IEEE international conference on cluster computing (CLUSTER). IEEE, pp 1–3

    Google Scholar 

  • Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data. ACM, pp 135–146

    Google Scholar 

  • Marin A, Wellman B (2011) Social network analysis: an introduction. The SAGE handbook of social network analysis, p 11. Sage Publications, Thousand Oaks

    Google Scholar 

  • Margolis B (2007). SOA for the business developer: concepts, BPEL, and SCA. Mc Press, Lewisville

    Google Scholar 

  • Marozzo F, Talia D, Trunfio P (2011) A cloud framework for parameter sweeping data mining applications. In: Proceedings of the 3rd IEEE international conference on cloud computing technology and science (CloudCom’11). IEEE Computer Society Press, Athens, pp 367–374. ISBN:978-0-7695-4622-3

    Chapter  Google Scholar 

  • Marozzo F, Talia D, Trunfio P (2015) Js4cloud: script-based workflow programming for scalable data analysis on cloud platforms. Concurr Comput Pract Exp 27(17):5214–5237

    Article  Google Scholar 

  • Marozzo F, Talia D, Trunfio P (2016) A workflow management system for scalable data mining on clouds. IEEE Trans Serv Comput PP(99):1–1

    Google Scholar 

  • Talia D, Trunfio P, Marozzo F (2015) Data analysis in the cloud. Elsevier. ISBN:978-0-12-802881-0

    Google Scholar 

  • Valiant LG (1990) A bridging model for parallel computation. Commun ACM 33(8):103–111

    Article  Google Scholar 

  • WFMC T (1999) Glossary, document number WFMC, issue 3.0. TC 1011

    Google Scholar 

  • Wilde M, Hategan M, Wozniak JM, Clifford B, Katz DS, Foster I (2011) Swift: a language for distributed parallel scripting. Parallel Comput 37(9):633–652

    Article  Google Scholar 

  • Wolstencroft K, Haines R, Fellows D, Williams A, Withers D, Owen S, Soiland-Reyes S, Dunlop I, Nenadic A, Fisher P et al (2013) The Taverna workflow suite: designing and executing workflows of web services on the desktop, web or in the cloud. Nucleic Acids Res 41(W1):W557–W561

    Article  Google Scholar 

  • Wozniak JM, Wilde M, Foster IT (2014) Language features for scalable distributed-memory dataflow computing. In: 2014 fourth workshop on data-flow execution models for extreme scale computing (DFM). IEEE, pp 50–53

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Loris Belcastro .

Editor information

Editors and Affiliations

Section Editor information

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this entry

Check for updates. Verify currency and authenticity via CrossMark

Cite this entry

Belcastro, L., Marozzo, F. (2018). Workflow Systems for Big Data Analysis. In: Sakr, S., Zomaya, A. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-63962-8_137-1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-63962-8_137-1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-63962-8

  • Online ISBN: 978-3-319-63962-8

  • eBook Packages: Springer Reference MathematicsReference Module Computer Science and Engineering

Publish with us

Policies and ethics

Chapter history

  1. Latest

    Workflow Systems for Big Data Analysis
    Published:
    14 February 2018

    DOI: https://doi.org/10.1007/978-3-319-63962-8_137-1

  2. Original

    Workflow Systems for Big Data Analysis
    Published:
    24 February 2012

    DOI: https://doi.org/10.1007/978-3-319-63962-8_137-2