MaDaTS: Managing Data on Tiered Storage for Scientific Workflows

Authors:
Devarshi Ghoshal

Lawrence Berkeley National Lab, Berkeley, CA, USA

Lawrence Berkeley National Lab, Berkeley, CA, USA
View Profile

,
Lavanya Ramakrishnan

Lawrence Berkeley National Lab, Berkeley, CA, USA

Lawrence Berkeley National Lab, Berkeley, CA, USA
View Profile

HPDC '17: Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed ComputingJune 2017Pages 41–52https://doi.org/10.1145/3078597.3078611

Published:26 June 2017Publication History

HPDC '17: Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing

Pages 41–52

ABSTRACT

Scientific workflows are increasingly used in High Performance Computing (HPC) environments to manage complex simulation and analyses, often consuming and generating large amounts of data. However, workflow tools have limited support for managing the input, output and intermediate data. The data elements of a workflow are often managed by the user through scripts or other ad-hoc mechanisms. Technology advances for future HPC systems is redefining the memory and storage subsystem by introducing additional tiers to improve the I/O performance of data-intensive applications. These architectural changes introduce additional complexities to managing data for scientific workflows. Thus, we need to manage the scientific workflow data across the tiered storage system on HPC machines. In this paper, we present the design and implementation of MaDaTS (Managing Data on Tiered Storage for Scientific Workflows), a software architecture that manages data for scientific workflows. We introduce Virtual Data Space (VDS), an abstraction of the data in a workflow that hides the complexities of the underlying storage system while allowing users to control data management strategies. We evaluate the data management strategies with real scientific and synthetic workflows, and demonstrate the capabilities of MaDaTS. Our experiments demonstrate the flexibility, performance and scalability gains of MaDaTS as compared to the traditional approach of managing data in scientific workflows.

References

Asif Akram, J Kewley, and Rob Allan. 2006. A Data centric approach for Workflows. In 2006 10th IEEE International Enterprise Distributed Object Computing Conference Workshops (EDOCW'06). Google ScholarDigital Library
William Allcock, John Bresnahan, Rajkumar Keimuthu, Michael Link, Catalin Dumitrescu, Ioan Raicu, and Ian Foster. 2005. The Globus Striped GridFTP Framework and Server. In Proceedings of the 2005 ACM/IEEE Conference on Supercomputing (SC '05). IEEE Computer Society, Washington, DC, USA, 54. Google ScholarDigital Library
Javier Rojas Balderrama, Matthieu Simonin, and Cedric Tedeschi. 2015. GinFlow: A Decentralised Adaptive Workflow Execution Manager. Ph.D. Dissertation. Inria.Google Scholar
Chao Chen, Michael Lang, Latchesar Ionkov, and Yong Chen. 2016. Active Burst- Butter: In-Transit Processing Integrated into Hierarchical Storage. In Networking, Architecture and Storage (NAS), 2016 IEEE International Conference on.Google ScholarCross Ref
Ann L. Chervenak, Robert Schuler, Matei Ripeanu, Muhammad Ali Amer, Shishir Bharathi, Ian Foster, Adriana Iamnitchi, and Carl Kesselman. 2009. The Globus Replica Location Service: Design and Experience. IEEE Trans. Parallel Distrib. Syst. 20, 9 (Sept. 2009). Google ScholarDigital Library
Christopher Daley, Devarshi Ghoshal, Glenn Lockwood, Sudip Dosanjh, Lavanya Ramakrishnan, and Nicholas Wright. 2016. Performance Characterization of Scientific Workflows for the Optimal Use of Burst Butters. In 11th Workshop on Workflows in Support of Large-Scale Science (WORKS'16).Google Scholar
E. Deelman and A. Chervenak. 2008. Data Management Challenges of Data- Intensive Scientific Workflows. In Cluster Computing and the Grid, 2008. CCGRID '08. 8th IEEE International Symposium on. Google ScholarDigital Library
Ewa Deelman, Gurmeet Singh, Mei-Hui Su, James Blythe, Yolanda Gil, Carl Kesselman, Gaurang Mehta, Karan Vahi, G Bruce Berriman, John Good, and others. 2005. Pegasus: A framework for mapping complex scientific workflows onto distributed systems. Scientific Programming 13, 3 (2005), 219--237. Google ScholarDigital Library
Ciprian Docan, Manish Parashar, and Scott Klasky. 2012. DataSpaces: an interaction and coordination framework for coupled simulation workflows. Cluster Computing 15, 2 (2012). Google ScholarDigital Library
Ian T. Foster, Jens-S. Vockler, Michael Wilde, and Yong Zhao. 2002. Chimera: AVirtual Data System for Representing, Querying, and Automating Data Derivation. In Proceedings of the 14th International Conference on Scientific and Statistical Database Management (SSDBM '02). IEEE Computer Society. Google ScholarDigital Library
Michael Franklin, Alon Halevy, and David Maier. 2005. From databases to dataspaces: a new abstraction for information management. ACM Sigmod Record 34, 4 (2005). Google ScholarDigital Library
Valerie Hendrix, James Fox, Devarshi Ghoshal, and Lavanya Ramakrishnan. 2016. Tigres workflow library: Supporting scientific pipelines on hpc systems. In Cluster, Cloud and Grid Computing (CCGrid), 2016 16th IEEE/ACM International Symposium on.Google ScholarDigital Library
D. Henseler, B. Landsteiner, D. Petesch, C. Wright, and N.J. Wright. 2016. Architecture and Design of Cray DataWarp. In Cray User Group CUG.Google Scholar
Stephen Herbein et al. 2016. Scalable I/O-Aware Job Scheduling for Burst Buffer Enabled HPC Clusters. In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC '16). Google ScholarDigital Library
Chen Jin, Scott Klasky, Stephen Hodson, Weikuan Yu, Jay Lofstead, Hasan Abbasi, Karsten Schwan, Matthew Wolf, W Liao, Alok Choudhary, and others. 2008. Adaptive io system (adios). Cray User's Group (2008).Google Scholar
Youngjae Kim, Aayush Gupta, Bhuvan Urgaonkar, Piotr Berman, and Anand Sivasubramaniam. 2011. HybridStore: A Cost-Efficient, High-Performance Storage System Combining SSDs and HDDs. In Proceedings of the 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS '11). Washington, DC, USA. Google ScholarDigital Library
David T. Liu and Michael J. Franklin. 2004. GridDB: A Data-centric Overlay for Scientific Grids. In the 30th International Conference on Very Large Data Bases. Google ScholarDigital Library
N. Liu, J. Cope, P. Carns, C. Carothers, R. Ross, G. Grider, A. Crume, and C. Maltzahn. 2012. On the role of burst buffers in leadership-class storage systems. In IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).Google Scholar
A. Luckow, L. Lacinski, and S. Jha. 2010. SAGA BigJob: An Extensible and Interoperable Pilot-Job Abstraction for Distributed Applications and Systems. In 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing. Google ScholarDigital Library
Henry M. Monti, Ali R. Buff, and Sudharshan S. Vazhkudai. 2013. On Timely Staging of HPC Job Input Data. IEEE Transactions on Parallel and Distributed Systems 24, 9 (2013). Google ScholarDigital Library
Bill Nitzberg and Virginia Lo. 1991. Distributed Shared Memory: A Survey of Issues and Algorithms. Computer 24, 8 (Aug. 1991). Google ScholarDigital Library
Ramya Prabhakar, Sudharshan S Vazhkudai, Youngjae Kim, Ali R Buff, Min Li, and Mahmut Kandemir. 2011. Provisioning a multi-tiered data staging area for extreme-scale machines. In 2011 31st International Conference on Distributed Computing Systems (ICDCS). Google ScholarDigital Library
Arcot Rajasekar, Reagan Moore, Chien-yi Hou, Christopher A Lee, Richard Marciano, Antoine de Torcy, Michael Wan, Wayne Schroeder, Sheau-Yen Chen, Lucas Gilbert, and others. 2010. iRODS Primer: integrated rule-oriented data system. Synthesis Lectures on Information Concepts, Retrieval, and Services 2, 1 (2010), 1--143. Google ScholarDigital Library
Lavanya Ramakrishnan and Beth Plale. 2010. A Multi-dimensional Classification Model for Scientific Workflow Characteristics. In the 1st International Workshop on Workflow Approaches to New Data-centric Science (Wands '10). ACM. Google ScholarDigital Library
Melissa Romanus, Fan Zhang, Tong Jin, Qian Sun, Hoang Bui, Manish Parashar, Jong Choi, Saloman Janhunen, Robert Hager, Scott Klasky, Choong-Seock Chang, and Ivan Rodero. 2016. Persistent Data Staging Services for Data Intensive Insitu Scientific Workflows. In Proceedings of the ACM International Workshop on Data-Intensive Distributed Computing (DIDC '16). ACM, New York, NY, USA, 8. Google ScholarDigital Library
Masahiro Tanaka and Osamu Tatebe. 2010. Pwrake: A Parallel and Distributed Flexible Workflow Management Tool for Wide-area Data Intensive Computing. In the 19th ACM International Symposium on High Performance Distributed Computing (HPDC '10). ACM, New York, NY, USA. Google ScholarDigital Library
Ian J Taylor, Ewa Deelman, Dennis B Gannon, and Matthew Shields. 2014. Workflows for e-Science: scientific workflows for grids. Springer Publishing Company. Google ScholarDigital Library
Teng Wang, Sarp Oral, Michael Pritchard, Kevin Vasko, and Weikuan Yu. 2015. Development of a Burst Buffer System for Data-Intensive Applications. CoRR (2015).Google Scholar
Michael Wilde, Mihael Hategan, Justin M Wozniak, Ben Clifford, Daniel S Katz, and Ian Foster. 2011. Swiff: A language for distributed parallel scripting. Parallel Comput. 37, 9 (2011). Google ScholarDigital Library
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 15--28. Google ScholarDigital Library
F. Zhang, C. Docan, M. Parashar, S. Klasky, N. Podhorszki, and H. Abbasi. 2012. Enabling In-situ Execution of Coupled Scientific Workflow on Multi-core Platform. In 26th International Parallel Distributed Processing Symposium (IPDPS). Google ScholarDigital Library
G. Zhang, L. Chiu, C. Dickey, L. Liu, P. Muench, and S. Seshadri. 2010. Automated lookahead data migration in SSD-enabled multi-tiered storage systems. In 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST). Google ScholarDigital Library
Zhe Zhang, Chao Wang, Sudharshan S. Vazhkudai, Xiaosong Ma, Gregory G. Pike, John W. Cobb, and Frank Mueller. 2007. Optimizing Center Performance Through Coordinated Data Staging, Scheduling and Recovery. In the 2007 ACM/IEEE Conference on Supercomputing (SC '07). ACM, New York, NY, USA. Google ScholarDigital Library
Fang Zheng, Hasan Abbasi, Ciprian Docan, Jay Lofstead, Qing Liu, Scott Klasky, Manish Parashar, Norbert Podhorszki, Karsten Schwan, and Matthew Wolf. 2010. PreDatA--preparatory data analytics on peta-scale machines. In Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on. IEEE.Google ScholarCross Ref

Index Terms

MaDaTS: Managing Data on Tiered Storage for Scientific Workflows
1. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Software infrastructure
        Middleware
    2. Software system structures
      1. Abstraction, modeling and modularity
      2. Software architectures
        Data flow architectures

Recommendations

Programming Abstractions for Managing Workflows on Tiered Storage Systems
Scientific workflows in High Performance Computing (HPC) environments are processing large amounts of data. The storage hierarchy on HPC systems is getting deeper, driven by new technologies (NVRAMs, SSDs, etc.) There is a need for new programming ...
Read More
Tiered data management system: Accelerating data processing on HPC systems
Abstract
The explosion of scientific data generated from large-scale simulations and advanced sensors makes scientific workflows more complex and more data-intensive. Supporting these data-intensive workflows on high-performance computing systems presents ...
Highlights
- Optimizing I/O performance for scientific workflows.
- Data management systems on tiered storage architecture.
- Customizing data management strategies for different workflow access patterns.
- Data-aware task scheduling.
Read More
Persistent Data Staging Services for Data Intensive In-situ Scientific Workflows
DIDC '16: Proceedings of the ACM International Workshop on Data-Intensive Distributed Computing

Scientific simulation workflows executing on very large scale computing systems are essential modalities for scientific investigation. The increasing scales and resolution of these simulations provide new opportunities for accurately modeling complex ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
HPDC '17: Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing
June 2017
254 pages
ISBN:9781450346993
DOI:10.1145/3078597
General Chairs:
Howie Huang
George Washington University, USA
,
Jon Weissman
University of Minnesota, USA
,
Program Chairs:
Adriana Iamnitchi
University of South Florida, USA
,
Alexandru Iosup
Vrije Universiteit Amsterdam and Delft University of Technology, NLD
Copyright © 2017 ACM
© 2017 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 June 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
burst buffer
data management
multi-tiered storage
scientific workflows
Qualifiers
- research-article
Conference

Acceptance Rates
HPDC '17 Paper Acceptance Rate19of100submissions,19%Overall Acceptance Rate166of966submissions,17%
More
Upcoming Conference
HPDC '24

Sponsor:

sigarch

The 33rd International Symposium on High-Performance Parallel and Distributed Computing

June 3 - 7, 2024

Pisa , Italy
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 16
  Total Citations
  View Citations
- 630
  Total Downloads
- Downloads (Last 12 months)77
- Downloads (Last 6 weeks)28
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

MaDaTS: Managing Data on Tiered Storage for Scientific Workflows

HPDC '17: Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Programming Abstractions for Managing Workflows on Tiered Storage Systems

Tiered data management system: Accelerating data processing on HPC systems

Persistent Data Staging Services for Data Intensive In-situ Scientific Workflows