ABSTRACT
Scientific workflows are increasingly used in High Performance Computing (HPC) environments to manage complex simulation and analyses, often consuming and generating large amounts of data. However, workflow tools have limited support for managing the input, output and intermediate data. The data elements of a workflow are often managed by the user through scripts or other ad-hoc mechanisms. Technology advances for future HPC systems is redefining the memory and storage subsystem by introducing additional tiers to improve the I/O performance of data-intensive applications. These architectural changes introduce additional complexities to managing data for scientific workflows. Thus, we need to manage the scientific workflow data across the tiered storage system on HPC machines. In this paper, we present the design and implementation of MaDaTS (Managing Data on Tiered Storage for Scientific Workflows), a software architecture that manages data for scientific workflows. We introduce Virtual Data Space (VDS), an abstraction of the data in a workflow that hides the complexities of the underlying storage system while allowing users to control data management strategies. We evaluate the data management strategies with real scientific and synthetic workflows, and demonstrate the capabilities of MaDaTS. Our experiments demonstrate the flexibility, performance and scalability gains of MaDaTS as compared to the traditional approach of managing data in scientific workflows.
- Asif Akram, J Kewley, and Rob Allan. 2006. A Data centric approach for Workflows. In 2006 10th IEEE International Enterprise Distributed Object Computing Conference Workshops (EDOCW'06). Google ScholarDigital Library
- William Allcock, John Bresnahan, Rajkumar Keimuthu, Michael Link, Catalin Dumitrescu, Ioan Raicu, and Ian Foster. 2005. The Globus Striped GridFTP Framework and Server. In Proceedings of the 2005 ACM/IEEE Conference on Supercomputing (SC '05). IEEE Computer Society, Washington, DC, USA, 54. Google ScholarDigital Library
- Javier Rojas Balderrama, Matthieu Simonin, and Cedric Tedeschi. 2015. GinFlow: A Decentralised Adaptive Workflow Execution Manager. Ph.D. Dissertation. Inria.Google Scholar
- Chao Chen, Michael Lang, Latchesar Ionkov, and Yong Chen. 2016. Active Burst- Butter: In-Transit Processing Integrated into Hierarchical Storage. In Networking, Architecture and Storage (NAS), 2016 IEEE International Conference on.Google ScholarCross Ref
- Ann L. Chervenak, Robert Schuler, Matei Ripeanu, Muhammad Ali Amer, Shishir Bharathi, Ian Foster, Adriana Iamnitchi, and Carl Kesselman. 2009. The Globus Replica Location Service: Design and Experience. IEEE Trans. Parallel Distrib. Syst. 20, 9 (Sept. 2009). Google ScholarDigital Library
- Christopher Daley, Devarshi Ghoshal, Glenn Lockwood, Sudip Dosanjh, Lavanya Ramakrishnan, and Nicholas Wright. 2016. Performance Characterization of Scientific Workflows for the Optimal Use of Burst Butters. In 11th Workshop on Workflows in Support of Large-Scale Science (WORKS'16).Google Scholar
- E. Deelman and A. Chervenak. 2008. Data Management Challenges of Data- Intensive Scientific Workflows. In Cluster Computing and the Grid, 2008. CCGRID '08. 8th IEEE International Symposium on. Google ScholarDigital Library
- Ewa Deelman, Gurmeet Singh, Mei-Hui Su, James Blythe, Yolanda Gil, Carl Kesselman, Gaurang Mehta, Karan Vahi, G Bruce Berriman, John Good, and others. 2005. Pegasus: A framework for mapping complex scientific workflows onto distributed systems. Scientific Programming 13, 3 (2005), 219--237. Google ScholarDigital Library
- Ciprian Docan, Manish Parashar, and Scott Klasky. 2012. DataSpaces: an interaction and coordination framework for coupled simulation workflows. Cluster Computing 15, 2 (2012). Google ScholarDigital Library
- Ian T. Foster, Jens-S. Vockler, Michael Wilde, and Yong Zhao. 2002. Chimera: AVirtual Data System for Representing, Querying, and Automating Data Derivation. In Proceedings of the 14th International Conference on Scientific and Statistical Database Management (SSDBM '02). IEEE Computer Society. Google ScholarDigital Library
- Michael Franklin, Alon Halevy, and David Maier. 2005. From databases to dataspaces: a new abstraction for information management. ACM Sigmod Record 34, 4 (2005). Google ScholarDigital Library
- Valerie Hendrix, James Fox, Devarshi Ghoshal, and Lavanya Ramakrishnan. 2016. Tigres workflow library: Supporting scientific pipelines on hpc systems. In Cluster, Cloud and Grid Computing (CCGrid), 2016 16th IEEE/ACM International Symposium on.Google ScholarDigital Library
- D. Henseler, B. Landsteiner, D. Petesch, C. Wright, and N.J. Wright. 2016. Architecture and Design of Cray DataWarp. In Cray User Group CUG.Google Scholar
- Stephen Herbein et al. 2016. Scalable I/O-Aware Job Scheduling for Burst Buffer Enabled HPC Clusters. In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC '16). Google ScholarDigital Library
- Chen Jin, Scott Klasky, Stephen Hodson, Weikuan Yu, Jay Lofstead, Hasan Abbasi, Karsten Schwan, Matthew Wolf, W Liao, Alok Choudhary, and others. 2008. Adaptive io system (adios). Cray User's Group (2008).Google Scholar
- Youngjae Kim, Aayush Gupta, Bhuvan Urgaonkar, Piotr Berman, and Anand Sivasubramaniam. 2011. HybridStore: A Cost-Efficient, High-Performance Storage System Combining SSDs and HDDs. In Proceedings of the 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS '11). Washington, DC, USA. Google ScholarDigital Library
- David T. Liu and Michael J. Franklin. 2004. GridDB: A Data-centric Overlay for Scientific Grids. In the 30th International Conference on Very Large Data Bases. Google ScholarDigital Library
- N. Liu, J. Cope, P. Carns, C. Carothers, R. Ross, G. Grider, A. Crume, and C. Maltzahn. 2012. On the role of burst buffers in leadership-class storage systems. In IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).Google Scholar
- A. Luckow, L. Lacinski, and S. Jha. 2010. SAGA BigJob: An Extensible and Interoperable Pilot-Job Abstraction for Distributed Applications and Systems. In 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing. Google ScholarDigital Library
- Henry M. Monti, Ali R. Buff, and Sudharshan S. Vazhkudai. 2013. On Timely Staging of HPC Job Input Data. IEEE Transactions on Parallel and Distributed Systems 24, 9 (2013). Google ScholarDigital Library
- Bill Nitzberg and Virginia Lo. 1991. Distributed Shared Memory: A Survey of Issues and Algorithms. Computer 24, 8 (Aug. 1991). Google ScholarDigital Library
- Ramya Prabhakar, Sudharshan S Vazhkudai, Youngjae Kim, Ali R Buff, Min Li, and Mahmut Kandemir. 2011. Provisioning a multi-tiered data staging area for extreme-scale machines. In 2011 31st International Conference on Distributed Computing Systems (ICDCS). Google ScholarDigital Library
- Arcot Rajasekar, Reagan Moore, Chien-yi Hou, Christopher A Lee, Richard Marciano, Antoine de Torcy, Michael Wan, Wayne Schroeder, Sheau-Yen Chen, Lucas Gilbert, and others. 2010. iRODS Primer: integrated rule-oriented data system. Synthesis Lectures on Information Concepts, Retrieval, and Services 2, 1 (2010), 1--143. Google ScholarDigital Library
- Lavanya Ramakrishnan and Beth Plale. 2010. A Multi-dimensional Classification Model for Scientific Workflow Characteristics. In the 1st International Workshop on Workflow Approaches to New Data-centric Science (Wands '10). ACM. Google ScholarDigital Library
- Melissa Romanus, Fan Zhang, Tong Jin, Qian Sun, Hoang Bui, Manish Parashar, Jong Choi, Saloman Janhunen, Robert Hager, Scott Klasky, Choong-Seock Chang, and Ivan Rodero. 2016. Persistent Data Staging Services for Data Intensive Insitu Scientific Workflows. In Proceedings of the ACM International Workshop on Data-Intensive Distributed Computing (DIDC '16). ACM, New York, NY, USA, 8. Google ScholarDigital Library
- Masahiro Tanaka and Osamu Tatebe. 2010. Pwrake: A Parallel and Distributed Flexible Workflow Management Tool for Wide-area Data Intensive Computing. In the 19th ACM International Symposium on High Performance Distributed Computing (HPDC '10). ACM, New York, NY, USA. Google ScholarDigital Library
- Ian J Taylor, Ewa Deelman, Dennis B Gannon, and Matthew Shields. 2014. Workflows for e-Science: scientific workflows for grids. Springer Publishing Company. Google ScholarDigital Library
- Teng Wang, Sarp Oral, Michael Pritchard, Kevin Vasko, and Weikuan Yu. 2015. Development of a Burst Buffer System for Data-Intensive Applications. CoRR (2015).Google Scholar
- Michael Wilde, Mihael Hategan, Justin M Wozniak, Ben Clifford, Daniel S Katz, and Ian Foster. 2011. Swiff: A language for distributed parallel scripting. Parallel Comput. 37, 9 (2011). Google ScholarDigital Library
- Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 15--28. Google ScholarDigital Library
- F. Zhang, C. Docan, M. Parashar, S. Klasky, N. Podhorszki, and H. Abbasi. 2012. Enabling In-situ Execution of Coupled Scientific Workflow on Multi-core Platform. In 26th International Parallel Distributed Processing Symposium (IPDPS). Google ScholarDigital Library
- G. Zhang, L. Chiu, C. Dickey, L. Liu, P. Muench, and S. Seshadri. 2010. Automated lookahead data migration in SSD-enabled multi-tiered storage systems. In 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST). Google ScholarDigital Library
- Zhe Zhang, Chao Wang, Sudharshan S. Vazhkudai, Xiaosong Ma, Gregory G. Pike, John W. Cobb, and Frank Mueller. 2007. Optimizing Center Performance Through Coordinated Data Staging, Scheduling and Recovery. In the 2007 ACM/IEEE Conference on Supercomputing (SC '07). ACM, New York, NY, USA. Google ScholarDigital Library
- Fang Zheng, Hasan Abbasi, Ciprian Docan, Jay Lofstead, Qing Liu, Scott Klasky, Manish Parashar, Norbert Podhorszki, Karsten Schwan, and Matthew Wolf. 2010. PreDatA--preparatory data analytics on peta-scale machines. In Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on. IEEE.Google ScholarCross Ref
Index Terms
- MaDaTS: Managing Data on Tiered Storage for Scientific Workflows
Recommendations
Programming Abstractions for Managing Workflows on Tiered Storage Systems
Scientific workflows in High Performance Computing (HPC) environments are processing large amounts of data. The storage hierarchy on HPC systems is getting deeper, driven by new technologies (NVRAMs, SSDs, etc.) There is a need for new programming ...
Tiered data management system: Accelerating data processing on HPC systems
AbstractThe explosion of scientific data generated from large-scale simulations and advanced sensors makes scientific workflows more complex and more data-intensive. Supporting these data-intensive workflows on high-performance computing systems presents ...
Highlights- Optimizing I/O performance for scientific workflows.
- Data management systems on tiered storage architecture.
- Customizing data management strategies for different workflow access patterns.
- Data-aware task scheduling.
Persistent Data Staging Services for Data Intensive In-situ Scientific Workflows
DIDC '16: Proceedings of the ACM International Workshop on Data-Intensive Distributed ComputingScientific simulation workflows executing on very large scale computing systems are essential modalities for scientific investigation. The increasing scales and resolution of these simulations provide new opportunities for accurately modeling complex ...
Comments