Elsevier

Future Generation Computer Systems

Volume 75, October 2017, Pages 423-437
Future Generation Computer Systems

A cross-layer optimized storage system for workflow applications

https://doi.org/10.1016/j.future.2017.02.038Get rights and content

Abstract

This paper proposes using file system custom metadata as a bidirectional communication channel between applications and the storage middleware. This channel can be used to pass hints that enable cross-layer optimizations, an option hindered today by the ossified file-system interface. We study this approach in the context of storage system support for large-scale workflow execution systems: Our workflow-optimized storage system (WOSS), exploits application hints to provide per-file optimized operations, and exposes data location to enable location-aware scheduling. We argue that an incremental adoption path for adopting cross-layer optimizations in storage exists, present the system architecture for a workflow-optimized storage system and its integration with a workflow runtime engine, and evaluate this approach using synthetic and real applications over multiple success metrics (application runtime, generated network stress, and energy). Our performance evaluation demonstrates that this design brings sizeable performance gains. On a large scale cluster (100 nodes), compared to two production class distributed storage systems (Ceph and GlusterFS), WOSS achieves up to 6× better performance for the synthetic benchmarks and 20–40% better application-level performance gain for real applications.

Introduction

Custom metadata features (a.k.a., ‘tagging’) have seen increased adoption in systems that support the storage, management, and analysis of ‘big-data’. However, the benefits expected are all essentially realized at the application level either by using metadata to present richer or differently organized information to users (e.g., enabling better search and navigability [1], [2]) or by implicitly communicating among applications that use the same data items (e.g., to support provenance, or inter-application coordination).

Our thesisis that, besides the above uses, custom metadata can be used as a bidirectional communication channel between applications and the storage system and thus become the key enabler for cross-layer optimizations that, today, are hindered by an ossified file-system interface.

This communication channel is bidirectional as the cross-layer optimizations enabled are based on information passed in both directions across the storage system interface (i.e., application to storage and storage to application). Possible cross-layer optimizations include:

  • (top-down) Applications can use metadata to provide hints to the storage system about their future behavior, such as: per-file access patterns, ideal data placement (e.g., co-usage), predicted file lifetime (i.e., temporary files vs. persistent results), access locality in a distributed setting, desired file replication level, or desired quality of service. These hints can be used to optimize the storage layer.

  • (bottom-up) The storage system can use metadata as a mechanism to expose key attributes of the data items stored. For example, a distributed storage system can provide information about data location, thus enabling location-aware scheduling.

The approach we propose has four interrelated advantages: it uses an application-agnostic mechanism, it is incremental, it offers a low cost for experimentation, and it focuses the research community effort on a single storage system prototype, saving considerable development and maintenance effort dedicated, nowadays, to multiple storage systems each targeting a specific workload (e.g., HDFS and PVFS [3]). First, the communication mechanism we propose: simply annotating files with arbitrary key, value pairs, is application-agnostic. Second, our approach enables evolving applications and storage-systems independently while maintaining the current interface (e.g., POSIX), and offers an incremental transition path for legacy applications and storage-systems: A legacy application will still work without changes (yet will not see performance gains) when deployed over a new storage system that supports cross-layer optimizations. Similarly a legacy storage will still support applications that attempt to convey optimization hints, yet it will not offer performance benefits. As storage and applications incrementally add support for passing and reacting to optimization hints, the overall system will see increasing gains. Finally, exposing information between different system layers implies tradeoffs between performance and transparency. To date, these tradeoffs have been scarcely explored. We posit that a flexible encoding (key/value pairs) as the information passing mechanism offers the flexibility to enable low-cost experimentation within this tradeoff space.

The approach we propose falls in the category of ‘guided mechanisms’ (i.e., solutions for applications to influence data placement, layout, and lifecycle), the focus of other projects as well. In effect, the wide range (and incompatibility) of past solutions proposed in the storage area in the past two decades (and incorporated to some degree by production systems — pNFS, PVFS [3], GPFS [4], Lustre, and other research projects [5], [6], [7]), only highlights that adopting a unifying abstraction is an area of high potential impact. The novelty of this paper comes from the “elegant simplicity” of the solution we propose. First, unlike past work, we maintain the existing API (predominantly POSIX compatible), and, within this API, we propose using the existing extended file attributes as a flexible, application-agnostic mechanism to pass hints across the application/storage divide. Second, and equally importantly, we propose an extensible storage system architecture that can be extended with application specific optimizations.

We demonstrate our approach by building a POSIX-compatible storage system to efficiently support one application domain: scientific workflows (an application domain detailed in Section  2 and Fig. 1). We chose this domain as this community has to support a large set of legacy applications (developed using the POSIX API). Our storage system is instantiated on-the fly to aggregate the resources of the computing nodes allocated to a batch application (e.g., disks, SSDs, and memory) and offers a shared file-system abstraction with two key features. First, it optimizes the data layout (e.g., file and block placement, file co-placement) to efficiently support the workflow data access patterns (as hinted by the application). Second, the storage system uses custom metadata to expose data location information so that the workflow runtime engine can make location-aware scheduling decisions. These two features are key to efficiently support workflow applications as their generated data access patterns are irregular and application-dependent.

Contributions. This project demonstrates that it is feasible to have a POSIX compatible storage system that can be yet optimized for each application (or application mix) even if the application has a different access pattern for different files. The key contributions of this work are:

  • We propose a new approach that uses custom metadata to enable cross-layer optimizations between applications and the storage system. Further, we argue that this approach can be adopted incrementally. This suggests an evolution path for co-designing POSIX-compatible file-systems together with the middleware ecosystem they coexist such that performance efficiencies are not lost and flexibility is preserved, a key concern to support legacy applications.

  • We present an extensible storage system architecture that supports cross-layer optimizations. We demonstrate the viability of this approach through a storage system prototype optimized for workflow applications (dubbed WOSS, Fig. 1). WOSS supports application-informed data placement based on per-file hints, and exposes data location to enable location-aware task scheduling. Importantly, we demonstrate that it is possible to achieve our goals, with only minor changes to the workflow scheduler, and without changing the application code or tasking the developer to annotate their code to reveal the data usage patterns.

  • We demonstrate, using synthetic benchmarks as well as three real-world workflows, that this design brings sizeable performance gains. On a large scale cluster (100 nodes), compared to two production class distributed storage systems (Ceph   [8] and GlusterFS   [9]), WOSS achieves up to 6× higher performance for the synthetic benchmarks and 20%–40% application-level performance gain for real applications.

Organization of this paper. The final section of this paper includes a detailed design discussion and design guidelines, discusses the limitations of this approach, and elaborates on the argument that custom metadata can benefit generic storage systems by enabling cross-layer optimizations (Section  5). Before that, we present the context (Section  2), the design (Section  3) and evaluation (Section  4) of a first storage system we designed in this style: the workflow-optimized storage system (WOSS).

Section snippets

Background and related work

This section starts by briefly setting up the context: the target application domain and the usage scenario. It then continues with a summary of data access patterns of workflow applications (Section  2.1) and a survey of related work on alleviating the storage bottleneck (Section  2.2.

The application domain: workflow applications. Metaapplications that assemble complex processing workflows using existing applications as their building blocks are increasingly popular in the science domain [10],

System architecture

To efficiently support the usage scenario targeted and the access patterns generated by workflow applications (Section  2.1), the WOSS design needs to provide per-file (or group of files) runtime configurability to support the diverse data-access patterns different workflow stages may have, and needs to be extensible: that is, to allow defining new optimizations and associate them with new custom attributes. This section presents the system design (Section  3.1) and integration with the

Evaluation

We have deployed MosaStore on a wealth of different platforms (e.g., from vanilla Linux clusters, to a BlueGene/P machine, to virtualized platforms like Grid5000 and Amazon EC2 — within one region and across multiple regions). We have also run a large number of workflow applications. This section summarizes our experience so far, more details are in our technical report [33].

The section is structured as follows: we first present the context of our comparison, then present the performance

Discussion and summary

Cross-layer optimizations bypass a restricted, ‘hourglass’, interface between system layers. A classic example is the TCP/IP stack: in the original design, the transport layer assumes that a lost packet is an indicator of congestion and backs-off. This assumption is violated in wireless environments and leads to degraded performance. To deal with this situation, a number of mechanisms expose the lower layers’ state and channel capability such that the upper layer can infer the cause of packet

Samer Al-Kiswany is an Assistant Professor at David R. Cheriton School of Computer Science at the University of Waterloo. He completed his postdoc at the University of Wisconsin–Madison, and his M.Sc. and Ph.D. degrees from the ECE Department at UBC. He is interested in distributed systems with focus on high performance computing systems, and cloud computing.

References (48)

  • S.F. Altschul

    Basic local alignment search tool

    Mol. Biol.

    (1990)
  • J. Koren, et al. Searching and navigating petabyte scale file systems based on facets, in: ACM Petascale Data Storage...
  • A.W. Leung, et al. Spyglass: Fast, scalable metadata search for large-scale storage systems, in: FAST,...
  • P.H. Carns, et al. PVFS: A parallel file system for linux clusters, in: 4th Annual Linux Showcase and Conference,...
  • F. Schmuck, R. Haskin, GPFS: A shared-disk file system for large computing clusters, in: 1st USENIX Conference on File...
  • G. Fedak, H. He, F. Cappello, BitDew: a programmable environment for large-scale data management and distribution, in:...
  • A.C. Arpaci-Dusseau

    Semantically-smart disk systems: past, present, and future

    SIGMETRICS Perform. Eval. Rev.

    (2006)
  • J. Schindler, et al. Track-aligned extents: Matching access patterns to disk drive characteristics, in: Conference on...
  • S.A.B. Sage Weil, Ethan L. Miller, Darrell D.E. Long, Carlos Maltzahn, Ceph: A scalable, high-performance distributed...
  • Trinity/NERSC-8 use-case scenarios, 2013 [cited...
  • J. Wozniak, M. Wilde, Case studies in storage access by loosely coupled petascale applications, in: Petascale Data...
  • T. Shibata, S. Choi, K. Taura, File-access patterns of data-intensive workflow applications and their implications to...
  • S. Bharathi, et al. Characterization of scientific workflows, in: Workshop on Workflows in Support of Large-Scale...
  • U. Yildiz, A. Guabtni, A.H.H. Ngu, Towards scientific workflow patterns, in: Workshop on Workflows in Support of...
  • I. Raicu, I.T. Foster, Y. Zhao, Many-task computing for grids and supercomputers, in: IEEE Workshop on Many-Task...
  • I. Foster

    Swift: A language for distributed parallel scripting

    J. Parallel Comput.

    (2011)
  • J. Bent, et al. Explicit control in a batch-aware distributed file system, in: Proceedings of the 1st USENIX Symposium...
  • S. Ghemawat, H. Gobioff, S.-T. Leung, The Google File System, in: 19th ACM Symposium on Operating Systems Principles,...
  • K. Gupta

    GPFS-SNC: An enterprise storage framework for virtual-machine clouds

    IBM J. Res. Dev.

    (2011)
  • M. Rosenblum et al.

    The design and implementation of a log-structured file system

    ACM Trans. Comput. Syst.

    (1992)
  • ROMIO: A high-performance, portable MPI-IO implementation. 2013. Available from:...
  • J.F. Lofstead, et al. Flexible IO and integration for scientific codes through the adaptable IO system (ADIOS), in:...
  • N. Mandagere, J. Diehl, D. Du, GreenStor: Application-aided energy-efficient storage, in: IEEE Conference on Mass...
  • K. Fujimoto, et al. Power-aware proactive storage-tiering management for high-speed tiered-storage systems, in:...
  • Cited by (0)

    Samer Al-Kiswany is an Assistant Professor at David R. Cheriton School of Computer Science at the University of Waterloo. He completed his postdoc at the University of Wisconsin–Madison, and his M.Sc. and Ph.D. degrees from the ECE Department at UBC. He is interested in distributed systems with focus on high performance computing systems, and cloud computing.

    Lauro Beltrao Costa joined Google after receiving his Ph.D. from the ECE Department at the University of British Columbia (UBC). Before, he received B.Sc. and M.Sc. degrees from UFCG, Brazil, where he worked on the OurGrid project. He is interested in distributed systems with focus on high-performance systems.

    Hao Yang joined Amazon after receiving his MS from the ECE Department at the University of British Columbia (UBC). He is interested in distributed systems with focus on storage systems.

    Emalayan Vairavanathan joined NetApp after receiving his MS from the ECE Department at the University of British Columbia (UBC). He is interested in distributed systems with focus on storage systems.

    Matei Ripeanu received his Ph.D. degree in Computer Science from the University of Chicago in 2005 before joining the ECE Department of UBC. Matei is interested in distributed systems, focusing on self-organization and decentralized control in large-scale systems. His research group’s work can be found at http://netsyslab.ece.ubc.ca.

    View full text