skip to main content
10.1145/2443416.2443417acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Makeflow: a portable abstraction for data intensive computing on clusters, clouds, and grids

Published: 20 May 2012 Publication History

Abstract

In recent years, there has been a renewed interest in languages and systems for large scale distributed computing. Unfortunately, most systems available to the end user use a custom description language tightly coupled to a specific runtime implementation, making it difficult to transfer applications between systems. To address this problem we introduce Makeflow, a simple system for expressing and running a data-intensive workflow across multiple execution engines without requiring changes to the application or workflow description. Makeflow allows any user familiar with basic Unix Make syntax to generate a workflow and run it on one of many supported execution systems. Furthermore, in order to assess the performance characteristics of the various execution engines available to users and assist them in selecting one for use we introduce Workbench, a suite of benchmarks designed for analyzing common workflow patterns. We evaluate Workbench on two physical architectures -- the first a storage cluster with local disks and a slower network and the second a high performance computing cluster with a central parallel filesystem and fast network -- using a variety of execution engines. We conclude by demonstrating three applications that use Makeflow to execute data intensive applications consisting of thousands of jobs.

References

[1]
Filesystem in user space. http://sourceforge.net/projects/fuse.
[2]
S. Ahuja, N. Carriero, and D. Gelernter. Linda and friends. IEEE Computer, 19(8):26--34, August 1986.
[3]
S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 3(215):403--410, Oct 1990.
[4]
Andrew Lih and Erez Zadok. PGMAKE: A Portable Distributed Make System. Technical Report CUCS-035-95, Computer Science Department, Columbia University, 1994.
[5]
E. H. Baalbergen. Design and implementation of parallel make. COMPUTING SYSTEMS, 1:135--158, 1988.
[6]
P. Bui, L. Yu, A. Thrasher, R. Carmichael, I. Lanc, P. Donnelly, and D. Thain. Scripting distributed scientific workflows using Weaver. Concurrency and Computation: Practice and Experience, 2011.
[7]
R. Carmichael, P. Braga-Henebry, D. Thain, and S. Emrich. Biocompute 2.0: An Improved Collaborative Workspace for Data Intensive Bio-Science. Concurrency and Computation: Practice and Experience, 23(17):2305--2314, 2011.
[8]
Cascading. http://www.cascading.org/, 2010.
[9]
Condor Team. The directed acyclic graph manager. http://www.cs.wisc.edu/condor/dagman, 2002.
[10]
A. E. Darling, L. Carey, and W. chun Feng. The design, implementation, and evaluation of mpiblast. In In Proceedings of ClusterWorld 2003, 2003.
[11]
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large cluster. In Operating Systems Design and Implementation, 2004.
[12]
E. Deelman, G. Singh, M.-H. Su, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, B. Berriman, J. Good, A. Laity, J. Jacob, and D. Katz. Pegasus: A framework for mapping complex scientific workflows onto distributed systems. Scientific Programming Journal, 13(3), 2005.
[13]
S. Feldman. Make -- A Program for Maintaining Computer Programs. Software: Practice and Experience, 9:255--265, November 1978.
[14]
W. Gentzsch. Sun grid engine: Towards creating a compute power grid. In CCGRID '01: Proceedings of the 1st International Symposium on Cluster Computing and the Grid, 2001.
[15]
S. Ghemawat, H. Gobioff, and S. Leung. The Google filesystem. In ACM Symposium on Operating Systems Principles, 2003.
[16]
Y. Gu and R. L. Grossman. 1 Sector and Sphere: The Design and Implementation of a High Performance Data Cloud.
[17]
Hadoop. http://hadoop.apache.org/, 2007.
[18]
B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang. Mars: a mapreduce framework on graphics processors. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques, PACT '08, pages 260--269, New York, NY, USA, 2008. ACM.
[19]
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data parallel programs from sequential building blocks. In Proceedings of EuroSys, March 2007.
[20]
W. Lu, J. Jackson, and R. Barga. Azureblast: a case study of developing science applications on the cloud. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC '10, pages 413--420, New York, NY, USA, 2010. ACM.
[21]
S. Mullender, G. van Rossum, A. Tanenbaum, R. van Renesse, and H. van Staveren. Amoeba: A distributed operating system for the 1990s. IEEE Computer, 23(5):44--53, 1990.
[22]
T. Oinn, M. Greenwood, M. Addis, M. N. Alpdemir, J. Ferris, K. Glover, C. Goble, A. Goderis, D. Hull, D. Marvin, P. Li, P. Lord, M. R. Pocock, M. Senger, R. Stevens, A. Wipat, and C. Wroe. Taverna: lessons in creating a workflow environment for the life sciences: Research articles. Concurr. Comput.: Pract. Exper., 18:1067--1100, August 2006.
[23]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1099--1110, New York, NY, USA, 2008. ACM.
[24]
A. Polze. Using the object space: A distributed parallel make. 4th IEEE Workshop on Future Trends of Distributed Computing Systems, 1993.
[25]
C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating mapreduce for multi-core and multiprocessor systems. In In HPCA âĂŹ07: Proceedings of the 13th International Symposium on High-Performance Computer Architecture, pages 13--24. IEEE Computer Society, 2007.
[26]
A. Regier. A Flexible Comparative Genomics Framework for Integrating Heterogeneous Sequence Data. PhD thesis, 2011.
[27]
K. Taura, T. Matsuzaki, M. Miwa, Y. Kamoshida, D. Yokoyama, N. Dun, T. Shibata, C. S. Jun, and J. Tsujii. Design and implementation of gxp make -- a workflow system based on make. IEEE Conference on eScience, 2010.
[28]
D. Thain and M. Livny. Parrot: An Application Environment for Data-Intensive Computing. Scalable Computing: Practice and Experience, 6(3):9--18, 2005.
[29]
D. Thain and C. Moretti. Abstractions for Cloud Computing with Condor. In S. Ahson and M. Ilyas, editors, Cloud Computing and Software Services: Theory and Techniques, pages 153--171. CRC Press, 2010.
[30]
D. Thain, T. Tannenbaum, and M. Livny. Condor and the grid. In F. Berman, G. Fox, and T. Hey, editors, Grid Computing: Making the Global Infrastructure a Reality. John Wiley, 2003.
[31]
A. Thrasher, R. Carmichael, P. Bui, L. Yu, D. Thain, and S. Emrich. Taming Complex Bioinformatics Workflows with Weaver, Makeflow, and Starch. In Workshop on Workflows in Support of Large Scale Science, pages 1--6, 2010.
[32]
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Anthony, H. Liu, and R. Murthy. Hive - a petabyte scale data warehouse using hadoop. In International Conference on Data Engineering, pages 996--1005, 2010.
[33]
A. Vahdat and T. Anderson. Transparent result caching. Proceedings of the 1998 USENIX Technical Conference, 1998.
[34]
L. Yu, C. Moretti, A. Thrasher, S. Emrich, K. Judd, and D. Thain. Harnessing Parallelism in Multicore Clusters with the All-Pairs, Wavefront, and Makeflow Abstractions. Journal of Cluster Computing, 13(3):243--256, 2010.
[35]
Y. Zhao, J. Dobson, L. Moreau, I. Foster, and M. Wilde. A notation and system for expressing and executing cleanly typed workflows on messy scientific data. In SIGMOD, 2005.

Cited By

View all
  • (2025)Reproducible research policies and software/data management in scientific computing journals: a survey, discussion, and perspectivesFrontiers in Computer Science10.3389/fcomp.2024.14918236Online publication date: 15-Jan-2025
  • (2024)Using open-science workflow tools to produce SCEC CyberShake physics-based probabilistic seismic hazard modelsFrontiers in High Performance Computing10.3389/fhpcp.2024.13607202Online publication date: 1-May-2024
  • (2024)TaPS: A Performance Evaluation Suite for Task-based Execution Frameworks2024 IEEE 20th International Conference on e-Science (e-Science)10.1109/e-Science62913.2024.10678702(1-10)Online publication date: 16-Sep-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SWEET '12: Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies
May 2012
58 pages
ISBN:9781450318761
DOI:10.1145/2443416
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 May 2012

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Funding Sources

Conference

SWEET 2012
Sponsor:

Acceptance Rates

Overall Acceptance Rate 4 of 6 submissions, 67%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)1
Reflects downloads up to 23 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Reproducible research policies and software/data management in scientific computing journals: a survey, discussion, and perspectivesFrontiers in Computer Science10.3389/fcomp.2024.14918236Online publication date: 15-Jan-2025
  • (2024)Using open-science workflow tools to produce SCEC CyberShake physics-based probabilistic seismic hazard modelsFrontiers in High Performance Computing10.3389/fhpcp.2024.13607202Online publication date: 1-May-2024
  • (2024)TaPS: A Performance Evaluation Suite for Task-based Execution Frameworks2024 IEEE 20th International Conference on e-Science (e-Science)10.1109/e-Science62913.2024.10678702(1-10)Online publication date: 16-Sep-2024
  • (2024)Disentangled Orchestration on Cyber RangesIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2023.330388821:4(2344-2360)Online publication date: 1-Jul-2024
  • (2024)Shepherd: Seamless Integration of Service Workflows into Task-Based Workflows through Log MonitoringProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00260(2080-2087)Online publication date: 17-Nov-2024
  • (2024)Ensemble Simulations on Leadership Computing SystemsProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00059(394-401)Online publication date: 17-Nov-2024
  • (2024)Reshaping High Energy Physics Applications for Near-Interactive Execution Using TaskVineProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00068(1-13)Online publication date: 17-Nov-2024
  • (2024)Dynamic Resource Management for Elastic Scientific Workflows using PMIx2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00131(686-695)Online publication date: 27-May-2024
  • (2024)RIGOLETTO: A Workflow Definition Language for Hybrid Quantum-Classical Scientific Applications2024 26th International Conference on Business Informatics (CBI)10.1109/CBI62504.2024.00015(40-49)Online publication date: 9-Sep-2024
  • (2024)Blueprints for Machine Ethics: A Digital Terrarium for Socio-Ethical Artificial Agent DecisionmakingIEEE Access10.1109/ACCESS.2024.351913912(195589-195612)Online publication date: 2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media