Elsevier

Journal of Systems and Software

Volume 118, August 2016, Pages 277-287
Journal of Systems and Software

IKAROS: A scalable I/O framework for high-performance computing systems

https://doi.org/10.1016/j.jss.2016.05.027Get rights and content

Highlights

  • Proposes a dynamically coordinated I/O architecture based on input parameters.

  • Creates, on the fly, dedicated or semi-dedicated clusters of HDDs per job.

  • Provides coordinated parallel data transfers on the overall data flow.

  • Minimizes disk and network contention.

  • Improves performance by 33% with the 1/3 of the available hard disks.

Abstract

High performance computing (HPC) has crossed the Petaflop mark and is reaching the Exaflop range quickly. The exascale system is projected to have millions of nodes, with thousands of cores for each node. At such an extreme scale, the substantial amount of concurrency can cause a critical contention issue for I/O system. This study proposes a dynamically coordinated I/O architecture for addressing some of the limitations that current parallel file systems and storage architectures are facing with very large-scale systems. The fundamental idea is to coordinate I/O accesses according to the topology/profile of the infrastructure, the load metrics, and the I/O demands of each application. The measurements have shown that by using IKAROS approach we can fully utilize the provided I/O and network resources, minimize disk and network contention, and achieve better performance.

Introduction

Large-scale scientific computations tend to stretch the limits of computational power and parallel computing is generally recognized as the only viable solution to high performance computing problems. I/O has become a bottleneck in application performance as processor speed skyrockets, leaving storage hardware and software struggling to keep up. Parallel file systems have be developed in order to allow applications to make optimum use of available processor parallelism. The most important factors affecting performance are the number of parallel processes participating in the transfers, the size of the individual transfers and of course the access patterns. The I/O access patterns are generally divided into the following subgroups (Schmuck and Haskin, 2002):

  • (1)

    Compulsory (consist of I/Os that must be made to read a program's initial state from the disk and write the final state back to disk when the program has finished. For example, a program might read a configuration file and perhaps an initial set of data points, and then write out the final set of data points along with graphical and textual representations of the results).

  • (2)

    Checkpoint/restart (are used to save the state of a computation in case of a hardware or software error which would require the simulation to be restarted).

  • (3)

    Regular snapshots of the computation's progress.

  • (4)

    Out-of-core read/writes for problems which do not fit to memory.

  • (5)

    Continuous output of data for visualization and other post-processing.

Another important factor that may significantly affect performance is the architecture of the storage system, on which we apply the file system. Nowadays, a typical HPC facility uses a small portion of the available nodes for storage purposes (I/O nodes acting as storage servers). Normally each storage server provides a huge number of hard disks through a RAID system. Current globally shared file systems, being deployed at the aforementioned facilities using current storage architectures, have several performance limitations when used with large-scale systems, because (Dongarra and Beckman, 2011):

  • (1)

    Bandwidth does not scale economically to large-scale systems.

  • (2)

    I/O traffic on the high speed network can be affected by other unrelated jobs.

  • (3)

    I/O traffic on each storage server can also be affected by other unrelated jobs.

The three (3) above problems are generally recognized as the most limiting factors for developing future exascale storage infrastructures. Exascale systems will require I/O bandwidth proportional to their computational capacity and it seems that current file systems and storage architectures will not be able to fulfill this requirement. One approach is to configure multiple instances of smaller capacity, higher bandwidth storage closer to the compute nodes (nearby storage) (Dongarra and Beckman, 2011). The multiple instances can provide exascale size bandwidth and capacity in aggregate and can avoid much of the impact on other jobs.

This approach does not provide the same file system semantics and functionality as a globally shared file system. In particular, it does not provide file cache coherency or distributed locking, but there are many use cases where those semantics are not required. Other globally shared file system semantics are required, such as a consistent file name space, and must be provided by a nearby storage infrastructure. In cases where the usage or lifetime of the application data is constrained a globally shared file system provides more functionality than the application's requirements while at the same time limits the bandwidth which the application can use. Nearby storage provides more bandwidth, but without offering globally shared file system behavior (Dongarra and Beckman, 2011).

The factors affecting performance are increasing if we consider the overall data flow (remote-local access) within an international collaborative scientific experiment, like the Large Hadron Collider (LHC) (LHC 2015) at CERN and KM3NeT (KM3NeT 2015). KM3NeT is a future European deep-sea research infrastructure hosting a new generation neutrino detectors that - located at the bottom of the Mediterranean Sea - will open a new window on the universe and answer fundamental questions both in particle physics and astrophysics.

This kind of experiments are generating datasets which are increasing exponentially in both complexity and volume, making their analysis, archival, and sharing one of the grand challenges of the 21st century. These experiments, in their majority, adopt computing models consisting of Tiers (each Tier is made up of several computing Centers and provides a specific set of services) and for the different steps of data processing (simulation, filtering, calibration, reconstruction and analysis) several software packages are utilized. The computational requirements are extremely demanding and, usually, spans from serial to multi-parallel or GPU-optimized jobs.

Furthermore, the collaborative nature of these experiments demands very frequent wide area network (WAN) data transfers and data sharing among individuals and groups. Typically, such a computing model utilizes several different computing infrastructures like: Grids, Clouds, HPCs, Data Centers and Local computing Clusters. The huge diversity of the utilized infrastructures and the lack of proper coordination between the different layers in the overall data flow create a schism between file systems and users (Allen et al., 2012). The usage of remote data access tools, such as NFS pNFS and GridFTP, offers a solution in this environment but these tools fail to provide universal, transparent and scalable remote data access.

Within the computing model of an international collaborative scientific experiment, category 4 (Out-of-core read/writes for problems which do not fit to memory) of the I/O access patterns subgroups is the most dominant and writes will need to be performed more often than reads. It turns out that the I/O requirements for current and especially for the future collaborative scientific experiments are extremely demanding and the write operations will not be a trivial issue to master.

In order to confront those challenges we introduced IKAROS as a framework that enables us to create ad-hoc nearby storage formations, able to use a huge number of I/O nodes to increase the available bandwidth (I/O and network) (Filippidis et al., 2013, Filippidis et al., 2012). It unifies remote and local access in the overall data flow by permitting direct access to each I/O node, regardless of the tier. In this way we can handle the overall data flow at the network layer, limit the interaction with the operating system, and minimize disk and network contention.

In this study we are using IKAROS approach to provide a dynamically coordinated I/O architecture for addressing the 3rd limitation that current parallel file systems and storage architectures are facing with very large-scale systems. The fundamental idea is to coordinate I/O accesses according to the topology/profile of the infrastructure, the load metrics, and the I/O demands of each application. We show how we can use IKAROS to extend further the nearby storage concept by creating on the fly, dedicated or semi-dedicated clusters of HDDs per job. We extend our previous studies by providing an in-depth analysis of the aspects affecting I/O performance in all Tiers of the computing model of an international collaborative scientific experiment. Our main focus is on out-of-core write operations that do not fit to memory, because it will be an extremely challenging issue at next generation experiments of this type (DOE ASCAC Data Subcommittee Report 2013).

The paper is organized as follows: in Section 2 we cite current file systems and mention the limitations by using them. In Section 3 we summarize the IKAROS design and the basic usage scenarios, related to this study. Finally, in Section 4 we evaluate IKAROS at a small office/home office Network Attached Storage (soho-NAS) environment (Filippidis et al., 2013) and at a High Performance Computing (HPC) environment.

Section snippets

Related work

Since the 1980s, there have been numerous proposals for shared and parallel file systems, such as the Network File System (NFS) (Stern, 1991), Andrew File System (AFS) (Braam, 1998), General Parallel File System (GPFS) (Schmuck and Haskin, 2002), Parallel Virtual File System (PVFS) (Carns et al., 2000), Lustre (Schwan, 2003), Panases (Ghemawat et al., 2003), Microsoft's Distributed File System (DFS) (Nagle et al., 2004), GlusterFS (Microsoft Inc. 2011), OneFS (GlusterFS 2011), POHMELFS (Isilon

IKAROS design

IKAROS provides a dynamically coordinated I/O architecture for I/O accesses according to the topology/profile of the infrastructure, the load metrics, and the I/O demands of each application. By referring to the I/O requirements/demands of the application we mean that IKAROS is not using a static/fixed algorithm for data placement. Due to the numerous configuration parameters offered the users and applications are able to choose the preferred strategy for each workload. In Fig. 1 we show an

IKAROS I/O request parameters

The IKAROS syntax follows the generic uniform resource identifier (URI) scheme:

< schemename >:< hierarchicalpart > [? < query >]

More specifically, IKAROS requests take the form:

http://hostname:port/ikaros?case&n2&n3&n4&n5&n6

where:

  • case is the module functionality selector (there are four main cases);

  • n2 is the requested file seek point;

  • n3 is the requested buffer size;

  • n4 is the requested chunk size;

  • n5 is the requested number of parallel data transfer channels;

  • n6 is the requested data file.

The

IKAROS in soho-NAS and HPC environment

The ability of IKAROS to use a huge number of low technical specification, low power consumption I/O nodes (Filippidis et al., 2013) through a dynamically coordinated I/O architecture in order to provide I/O accesses according to the topology/profile of the infrastructure, the load metrics, and the I/O demands of each application allows us to work towards addressing the limitations current file systems and storage architectures are facing with large-scale systems, because:

  • (1)

    Bandwidth does not

Conclusions and future work

This study proposes a dynamically coordinated I/O architecture for addressing some of the limitations that current parallel file systems and storage architectures are facing with very large-scale systems. The fundamental idea is to coordinate I/O accesses according to the topology/profile of the infrastructure, the load metrics, and the I/O demands of each application.

By using the IKAROS reverse read technique, for write operations, we are able to apply only coordinated parallel data transfers

Acknowledgment

We want to thank associate Professor Stathes p. Hadjiefthymiades of University of Athens for providing us his support in order to further advance this work.

This work was supported by the Cy-Tera Project (ΝΕΑ ΥΠΟΔΟΜΗ /ΣΤΡΑΤΗ/0308/31), which is co-funded by the European Regional Development Fund and the Republic of Cyprus through the Research Promotion Foundation.

Christos Filippidis is a PhD candidate at the Department of Informatics and Telecommunications, National and Kapodistrian University of Athens. He holds a B.Sc in Electronic Engineering, and M.Sc in Communications systems & Networks. He is a member of the Large Hadron Collider experiment at CERN, the KM3NeT.org consortium and professional member of ACM. His research interests include data transfer & management, high performance computer systems architectures, grid/cloud computing, and computer

References (43)

  • H. Abbasi et al.

    Just in time: adding value to the IO pipelines of high performance applications with JITStaging

  • H. Abbasi et al.

    DataStager: scalable data staging services for petascale applications

    Cluster Comput.

    (2010)
  • S. Al-Kiswany et al.

    The case for a versatile storage system

  • B. Allen et al.

    Software as a service for data scientists

    Commun. ACM

    (2012)
  • P.J. Braam

    The coda distributed file system

    Linux Journal

    (1998)
  • P.H. Carns et al.

    PVFS: a parallel file system for Linux clusters

  • Circle: (2012) http://savannah.nongnu.org/projects/circle/. Accessed 4 Sept...
  • CloudStore, (2012) http://code.google.com/p/kosmosfs/. Accessed 4 Sept...
  • DOE ASCAC Data Subcommittee Report, 2013 Synergistic challenges in data-intensive science and exascale computing (March...
  • J. Dongarra et al.

    The international exascale software roadmap,”

    Int. J. High Perform. Comput. Appl.

    (2011)
  • P. Druschel et al.

    Past: persistent and anonymous storage in a peer-to-peer networking environment

  • C. Filippidis et al.

    Design and implementation of the mobile grid resource management system

    Comput. Sci.

    (2012)
  • C. Filippidis et al.

    IKAROS: an HTTP-based distributed File System, for low consumption and low specification devices

    J. Grid Comput. Springer

    (2013)
  • C. Filippidis et al.

    Forming an ad-hoc nearby storage, based on the IKAROS and social networking services

    IOP J. Phys.: Conf. Ser.

    (2014)
  • S. Ghemawat et al.

    The Google file system

  • GlusterFS, (2011) http://www.gluster.com/. Accessed 3 Sept...
  • Y. Gu et al.

    Distributing the sloan digital sky survey using UDT and sector

  • HTCocndor. 2016....
  • F. Hupfeld et al.

    XtreemFS—a case for object-based storage in grid data management

  • F. Isaila et al.

    Design and evaluation of multiple-level data staging for blue gene systems

    IEEE Trans. Parallel Distrib. Syst.

    (2011)
  • Isilon Systems, (2012) OneFS. http://www.isilon.com/. Accessed 5 June...
  • Cited by (3)

    • Combining malleability and I/O control mechanisms to enhance the execution of multiple applications

      2019, Journal of Systems and Software
      Citation Excerpt :

      This system includes elastic partitions that can scale up and down with the number of storage resources. Another solution is IKAROS (Filippidis et al., 2016), that permits the dynamic creation of clusters of storage nodes per job including both local and remote storage resources, based on application characteristics. In the context of cloud computing, some solutions for elasticity in I/O have been proposed, such as SpringFS (Xu et al., 2014) as well as solutions based on the Hadoop Distributed File System (HDFS) (Lim et al., 2010; Cheng et al., 2012).

    • PWebDAV: A Multi-Tier Storage System

      2018, Proceedings - 26th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2018

    Christos Filippidis is a PhD candidate at the Department of Informatics and Telecommunications, National and Kapodistrian University of Athens. He holds a B.Sc in Electronic Engineering, and M.Sc in Communications systems & Networks. He is a member of the Large Hadron Collider experiment at CERN, the KM3NeT.org consortium and professional member of ACM. His research interests include data transfer & management, high performance computer systems architectures, grid/cloud computing, and computer networks. Web Page: http://cern.ch/filippidis

    Prof. Panayiotis Tsanakas, Chairman of the GRNET, is currently serving as professor of computer science in the School of Electrical and Computer Engineering of the National Technical University of Athens. He holds a B.Sc in electrical engineering, M.Sc and Ph.D in Computer Engineering. He has participated in several national and EU-sponsored projects, in subjects covering e-Infrastructures, distributed systems, parallel computer architectures, and medical informatics. He is currently the Chairman of GRNET, responsible for the deployment and operation of the national R&E network, along with large-scale national HPC and IaaS facilities. His research interests include HPC systems architectures, grid/cloud computing, and distributed applications in medicine.

    Yiannis Cotronis is an Associate Professor at the Department of Informatics and Telecommunications, national and Kapodistrian University of Athens. He holds a B.Sc. in Mathematics, M.Sc. and Ph.D. in Computing Science. His current research interests include software engineering for parallel programming and e-science applications. He has chaired EuroPVM/MPI 2001, Euromicro PDP 2000, 2011, 2016 and EuroMPI 2011.

    View full text