skip to main content
10.1145/3078597.3078611acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article
Public Access

MaDaTS: Managing Data on Tiered Storage for Scientific Workflows

Published:26 June 2017Publication History

ABSTRACT

Scientific workflows are increasingly used in High Performance Computing (HPC) environments to manage complex simulation and analyses, often consuming and generating large amounts of data. However, workflow tools have limited support for managing the input, output and intermediate data. The data elements of a workflow are often managed by the user through scripts or other ad-hoc mechanisms. Technology advances for future HPC systems is redefining the memory and storage subsystem by introducing additional tiers to improve the I/O performance of data-intensive applications. These architectural changes introduce additional complexities to managing data for scientific workflows. Thus, we need to manage the scientific workflow data across the tiered storage system on HPC machines. In this paper, we present the design and implementation of MaDaTS (Managing Data on Tiered Storage for Scientific Workflows), a software architecture that manages data for scientific workflows. We introduce Virtual Data Space (VDS), an abstraction of the data in a workflow that hides the complexities of the underlying storage system while allowing users to control data management strategies. We evaluate the data management strategies with real scientific and synthetic workflows, and demonstrate the capabilities of MaDaTS. Our experiments demonstrate the flexibility, performance and scalability gains of MaDaTS as compared to the traditional approach of managing data in scientific workflows.

References

  1. Asif Akram, J Kewley, and Rob Allan. 2006. A Data centric approach for Workflows. In 2006 10th IEEE International Enterprise Distributed Object Computing Conference Workshops (EDOCW'06). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. William Allcock, John Bresnahan, Rajkumar Keimuthu, Michael Link, Catalin Dumitrescu, Ioan Raicu, and Ian Foster. 2005. The Globus Striped GridFTP Framework and Server. In Proceedings of the 2005 ACM/IEEE Conference on Supercomputing (SC '05). IEEE Computer Society, Washington, DC, USA, 54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Javier Rojas Balderrama, Matthieu Simonin, and Cedric Tedeschi. 2015. GinFlow: A Decentralised Adaptive Workflow Execution Manager. Ph.D. Dissertation. Inria.Google ScholarGoogle Scholar
  4. Chao Chen, Michael Lang, Latchesar Ionkov, and Yong Chen. 2016. Active Burst- Butter: In-Transit Processing Integrated into Hierarchical Storage. In Networking, Architecture and Storage (NAS), 2016 IEEE International Conference on.Google ScholarGoogle ScholarCross RefCross Ref
  5. Ann L. Chervenak, Robert Schuler, Matei Ripeanu, Muhammad Ali Amer, Shishir Bharathi, Ian Foster, Adriana Iamnitchi, and Carl Kesselman. 2009. The Globus Replica Location Service: Design and Experience. IEEE Trans. Parallel Distrib. Syst. 20, 9 (Sept. 2009). Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Christopher Daley, Devarshi Ghoshal, Glenn Lockwood, Sudip Dosanjh, Lavanya Ramakrishnan, and Nicholas Wright. 2016. Performance Characterization of Scientific Workflows for the Optimal Use of Burst Butters. In 11th Workshop on Workflows in Support of Large-Scale Science (WORKS'16).Google ScholarGoogle Scholar
  7. E. Deelman and A. Chervenak. 2008. Data Management Challenges of Data- Intensive Scientific Workflows. In Cluster Computing and the Grid, 2008. CCGRID '08. 8th IEEE International Symposium on. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Ewa Deelman, Gurmeet Singh, Mei-Hui Su, James Blythe, Yolanda Gil, Carl Kesselman, Gaurang Mehta, Karan Vahi, G Bruce Berriman, John Good, and others. 2005. Pegasus: A framework for mapping complex scientific workflows onto distributed systems. Scientific Programming 13, 3 (2005), 219--237. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Ciprian Docan, Manish Parashar, and Scott Klasky. 2012. DataSpaces: an interaction and coordination framework for coupled simulation workflows. Cluster Computing 15, 2 (2012). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Ian T. Foster, Jens-S. Vockler, Michael Wilde, and Yong Zhao. 2002. Chimera: AVirtual Data System for Representing, Querying, and Automating Data Derivation. In Proceedings of the 14th International Conference on Scientific and Statistical Database Management (SSDBM '02). IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Michael Franklin, Alon Halevy, and David Maier. 2005. From databases to dataspaces: a new abstraction for information management. ACM Sigmod Record 34, 4 (2005). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Valerie Hendrix, James Fox, Devarshi Ghoshal, and Lavanya Ramakrishnan. 2016. Tigres workflow library: Supporting scientific pipelines on hpc systems. In Cluster, Cloud and Grid Computing (CCGrid), 2016 16th IEEE/ACM International Symposium on.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Henseler, B. Landsteiner, D. Petesch, C. Wright, and N.J. Wright. 2016. Architecture and Design of Cray DataWarp. In Cray User Group CUG.Google ScholarGoogle Scholar
  14. Stephen Herbein et al. 2016. Scalable I/O-Aware Job Scheduling for Burst Buffer Enabled HPC Clusters. In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC '16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Chen Jin, Scott Klasky, Stephen Hodson, Weikuan Yu, Jay Lofstead, Hasan Abbasi, Karsten Schwan, Matthew Wolf, W Liao, Alok Choudhary, and others. 2008. Adaptive io system (adios). Cray User's Group (2008).Google ScholarGoogle Scholar
  16. Youngjae Kim, Aayush Gupta, Bhuvan Urgaonkar, Piotr Berman, and Anand Sivasubramaniam. 2011. HybridStore: A Cost-Efficient, High-Performance Storage System Combining SSDs and HDDs. In Proceedings of the 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS '11). Washington, DC, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. David T. Liu and Michael J. Franklin. 2004. GridDB: A Data-centric Overlay for Scientific Grids. In the 30th International Conference on Very Large Data Bases. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. N. Liu, J. Cope, P. Carns, C. Carothers, R. Ross, G. Grider, A. Crume, and C. Maltzahn. 2012. On the role of burst buffers in leadership-class storage systems. In IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).Google ScholarGoogle Scholar
  19. A. Luckow, L. Lacinski, and S. Jha. 2010. SAGA BigJob: An Extensible and Interoperable Pilot-Job Abstraction for Distributed Applications and Systems. In 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Henry M. Monti, Ali R. Buff, and Sudharshan S. Vazhkudai. 2013. On Timely Staging of HPC Job Input Data. IEEE Transactions on Parallel and Distributed Systems 24, 9 (2013). Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Bill Nitzberg and Virginia Lo. 1991. Distributed Shared Memory: A Survey of Issues and Algorithms. Computer 24, 8 (Aug. 1991). Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Ramya Prabhakar, Sudharshan S Vazhkudai, Youngjae Kim, Ali R Buff, Min Li, and Mahmut Kandemir. 2011. Provisioning a multi-tiered data staging area for extreme-scale machines. In 2011 31st International Conference on Distributed Computing Systems (ICDCS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Arcot Rajasekar, Reagan Moore, Chien-yi Hou, Christopher A Lee, Richard Marciano, Antoine de Torcy, Michael Wan, Wayne Schroeder, Sheau-Yen Chen, Lucas Gilbert, and others. 2010. iRODS Primer: integrated rule-oriented data system. Synthesis Lectures on Information Concepts, Retrieval, and Services 2, 1 (2010), 1--143. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Lavanya Ramakrishnan and Beth Plale. 2010. A Multi-dimensional Classification Model for Scientific Workflow Characteristics. In the 1st International Workshop on Workflow Approaches to New Data-centric Science (Wands '10). ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Melissa Romanus, Fan Zhang, Tong Jin, Qian Sun, Hoang Bui, Manish Parashar, Jong Choi, Saloman Janhunen, Robert Hager, Scott Klasky, Choong-Seock Chang, and Ivan Rodero. 2016. Persistent Data Staging Services for Data Intensive Insitu Scientific Workflows. In Proceedings of the ACM International Workshop on Data-Intensive Distributed Computing (DIDC '16). ACM, New York, NY, USA, 8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Masahiro Tanaka and Osamu Tatebe. 2010. Pwrake: A Parallel and Distributed Flexible Workflow Management Tool for Wide-area Data Intensive Computing. In the 19th ACM International Symposium on High Performance Distributed Computing (HPDC '10). ACM, New York, NY, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Ian J Taylor, Ewa Deelman, Dennis B Gannon, and Matthew Shields. 2014. Workflows for e-Science: scientific workflows for grids. Springer Publishing Company. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Teng Wang, Sarp Oral, Michael Pritchard, Kevin Vasko, and Weikuan Yu. 2015. Development of a Burst Buffer System for Data-Intensive Applications. CoRR (2015).Google ScholarGoogle Scholar
  29. Michael Wilde, Mihael Hategan, Justin M Wozniak, Ben Clifford, Daniel S Katz, and Ian Foster. 2011. Swiff: A language for distributed parallel scripting. Parallel Comput. 37, 9 (2011). Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 15--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. F. Zhang, C. Docan, M. Parashar, S. Klasky, N. Podhorszki, and H. Abbasi. 2012. Enabling In-situ Execution of Coupled Scientific Workflow on Multi-core Platform. In 26th International Parallel Distributed Processing Symposium (IPDPS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. G. Zhang, L. Chiu, C. Dickey, L. Liu, P. Muench, and S. Seshadri. 2010. Automated lookahead data migration in SSD-enabled multi-tiered storage systems. In 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST). Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Zhe Zhang, Chao Wang, Sudharshan S. Vazhkudai, Xiaosong Ma, Gregory G. Pike, John W. Cobb, and Frank Mueller. 2007. Optimizing Center Performance Through Coordinated Data Staging, Scheduling and Recovery. In the 2007 ACM/IEEE Conference on Supercomputing (SC '07). ACM, New York, NY, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Fang Zheng, Hasan Abbasi, Ciprian Docan, Jay Lofstead, Qing Liu, Scott Klasky, Manish Parashar, Norbert Podhorszki, Karsten Schwan, and Matthew Wolf. 2010. PreDatA--preparatory data analytics on peta-scale machines. In Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on. IEEE.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. MaDaTS: Managing Data on Tiered Storage for Scientific Workflows

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          HPDC '17: Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing
          June 2017
          254 pages
          ISBN:9781450346993
          DOI:10.1145/3078597

          Copyright © 2017 ACM

          © 2017 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 26 June 2017

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          HPDC '17 Paper Acceptance Rate19of100submissions,19%Overall Acceptance Rate166of966submissions,17%

          Upcoming Conference

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader